Metadata Modelling
Please keep in mind the models are trained to form parent node classifiers within the cascading taxonomic structure.
The metadata modelling process is accomplished through the below scripts:
Pipeline
This file performs data cleaning, transformation, and structuring for use within the metadata models.
Model Training
This file performs all metadata classification model training.
Specifically this file, performs metadata classification training at all taxonomic levels across all proposed models.
This results in 5 complete cascading taxonomic classifiers that are compared at each taxonomic level to determine the most robust
metadata classifier. For the model comparison, please review notebooks/meta_modelling/meta_data_model_comparison.ipynb
Neural Network
The neural network metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal learning rate for the model due to the varying levels of abstraction generated at different taxonomic levels.
Random Forest
The Random Forest metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal tree depth for the estimators within the ensemble method.
XGBoost
The XGBoost metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal tree depth within the XGBoost model.
Decision Tree
The decision tree metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal decision tree depth for the model.
AdaBoost
The AdaBoost metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal number of estimators to be used within the ensemble model.
K-means Silhouette Score Automation
The Silhouette score is a method of automating the selection of the number of centroids for a K-means clustering algorithms. This was used to determine the optimal number of centroids used to capture the geographic location distribution of at each parent node, to create a useful location encoding within the data.