Metadata Modelling

Please keep in mind the models are trained to form parent node classifiers within the cascading taxonomic structure.

The metadata modelling process is accomplished through the below scripts:

Pipeline

This file performs data cleaning, transformation, and structuring for use within the metadata models.

This file performs all metadata classification model training. Specifically this file, performs metadata classification training at all taxonomic levels across all proposed models. This results in 5 complete cascading taxonomic classifiers that are compared at each taxonomic level to determine the most robust metadata classifier. For the model comparison, please review notebooks/meta_modelling/meta_data_model_comparison.ipynb

Neural Network

The neural network metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal learning rate for the model due to the varying levels of abstraction generated at different taxonomic levels.

Random Forest

The Random Forest metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal tree depth for the estimators within the ensemble method.

XGBoost

The XGBoost metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal tree depth within the XGBoost model.

Decision Tree

The decision tree metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal decision tree depth for the model.

AdaBoost

The AdaBoost metadata model training and evaluation process. Hyperparameter tuning involves determining the optimal number of estimators to be used within the ensemble model.

K-means Silhouette Score Automation

The Silhouette score is a method of automating the selection of the number of centroids for a K-means clustering algorithms. This was used to determine the optimal number of centroids used to capture the geographic location distribution of at each parent node, to create a useful location encoding within the data.