Skip to content

Metadata Model Training

This file performs metadata model training for all proposed models, at all taxonomic levels.

The metadata model training is automated to train the proposed models (Decision Tree, Random Forest, Adaboost, XGBoost, and a Neural Network) at each taxonomic parent node of the dataset. This forms five cascading taxonomic classifiers, with a model at each taxonomic parent node. This enables comparison of the models at each taxonomic level, to determine the most robust and optimal model to use as a metadata classifier.

Attributes:

Name Type Description
model_abbreviations dict

A dictionary containing the names of the classification models as keys, and their abbreviations as values.

model_save_types dict

A dictionary containing the names of the classification models as keys, and their relevant file types when saved.

file_name_taxon dict

A dictionary containing the taxonomic level indicators in the dataset, and their relevant abbreviations to be used in file naming.

dataset_iteration(observation_file, metadata_file)

This method is performs the full metadata training for all models at all available taxonomic levels for the provided dataset. Only a single dataset is trained at a time

The information printed out, is to be used within the model_comparison.ipynb to direct the model validation and figure construction. For more information, please review the model_comparison notebook.

Parameters:

Name Type Description Default
observation_file str

The name of the observation files within the data/processed/ directory.

required
metadata_file str

The name of the metadata files within the data/processed/ directory. Note they must coincide with the order of the observation files.

required
Source code in src/models/meta/model_training.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def dataset_iteration(observation_file: str, metadata_file: str):
    """This method is performs the full metadata training for all models at all available taxonomic levels for the
    provided dataset. Only a single dataset is trained at a time


    The information printed out, is to be used within the `model_comparison.ipynb` to direct the model validation and figure construction.
    For more information, please review the `model_comparison` notebook.

    Args:
        observation_file (str): The name of the observation files within the `data/processed/` directory.
        metadata_file (str): The name of the metadata files within the `data/processed/` directory. Note they must coincide with the order of the observation files.
    """
    models, model_name_collection, taxon_target_collection, abbreviations_collection = model_iteration(observation_file, metadata_file)

    # Information to be used in the model_comparison.ipynb notebook
    print('Models: ', models)
    print('File names: ', model_name_collection)
    print('Taxon targets: ', taxon_target_collection)
    print('Model abbreviations: ', abbreviations_collection)

generate_file_name_start(restriction)

Method standardizes the parent taxonomic restriction to create a suitable filename for each model

This method removes white space, replacing it with an underscore, and ensures the name is all lower case.

Parameters:

Name Type Description Default
restriction str

The label of the taxonomic parent node (restriction)

required

Returns:

Type Description
str

A standardized form of the restriction.

Source code in src/models/meta/model_training.py
191
192
193
194
195
196
197
198
199
200
201
202
203
def generate_file_name_start(restriction: str):
    """Method standardizes the parent taxonomic restriction to create a suitable filename for each model

        This method removes white space, replacing it with an underscore, and ensures the name is all lower case.
        Args:
            restriction (str): The label of the taxonomic parent node (restriction)

        Returns:
            (str): A standardized form of the restriction.
    """
    restriction = restriction.replace(" ", "_")
    restriction = restriction.lower()
    return restriction

model_iteration(observation_file, metadata_file)

Method performs the model iteration per dataset.

Parameters:

Name Type Description Default
observation_file str

The processed iNaturalist observations dataset.

required
metadata_file str

The corresponding metadata for the observation file.

required

Returns:

Name Type Description
models list

The list of all models iterated over.

model_name_collection list

The list of all model names produced during the iteration of the dataset.

taxon_target_collection list

The list of all taxonomic targets iterated through within this dataset.

abbreviations_collection list

A list of the corresponding model abbreviations

Source code in src/models/meta/model_training.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def model_iteration(observation_file: str, metadata_file: str):
    """Method performs the model iteration per dataset.

    Args:
        observation_file (str): The processed iNaturalist observations dataset.
        metadata_file (str): The corresponding metadata for the observation file.

    Returns:
        models (list): The list of all models iterated over.
        model_name_collection (list): The list of all model names produced during the iteration of the dataset.
        taxon_target_collection (list): The list of all taxonomic targets iterated through within this dataset.
        abbreviations_collection (list): A list of the corresponding model abbreviations
    """
    # Multi-model collections
    models = list(model_abbreviations.keys())

    model_name_collection = []  # Collections across multiple model collections (this is the same for all models)
    taxon_target_collection = []
    abbreviations_collection = []

    taxon_models = []  # Collections across single models
    taxon_targets = []

    for model in models:  # Iterate across all models
        taxon_models, taxon_targets = taxonomic_level_modelling(observation_file, metadata_file, model)  # Iterate across all taxonomic levels per model

        abbreviations_collection.append(model_abbreviations[model])  # Collect essential information regarding file names for future use

    model_name_collection.append(taxon_models)  # Collect model file names which indicate taxonomic level (supplemented by model abbreviations)
    taxon_target_collection.append(taxon_targets)  # Collect taxonomic target levels

    return models, model_name_collection, taxon_target_collection, abbreviations_collection  # Return gathered file information

model_selection_execution(model, df, target_taxon, model_name, training_history, validation_file)

This method allows multiple models to be trained through the specification of the model type, and the subsequent execution of the required data pipeline.

Parameters:

Name Type Description Default
model str

Specification of the model to be used to classify the taxonomic child nodes.

required
df DataFrame

The combined observation and metadata dataframe with the taxonomic parent node restriction applied. Only taxonomic child labels are present in df.

required
target_taxon str

Specification of the taxonomic level of the taxon child nodes (not the taxonomic level of the parent node)

required
model_name str

The complete model name (parent taxon label and model abbreviation make the combined name unique)

required
training_history str

File name at which to save the model training history.

required
validation_file str

File name at which to save the validation dataset.

required

Returns:

Type Description
None

This method returns nothing. The return statement was used to ensure case stopping.

Source code in src/models/meta/model_training.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
def model_selection_execution(model: str,
                              df: pd.DataFrame,
                              target_taxon: str,
                              model_name: str,
                              training_history: str,
                              validation_file: str):
    """This method allows multiple models to be trained through the specification of the model type, and the subsequent execution of the required data pipeline.

    Args:
        model (str): Specification of the model to be used to classify the taxonomic child nodes.
        df (DataFrame): The combined observation and metadata dataframe with the taxonomic parent node restriction applied. Only taxonomic child labels are present in df.
        target_taxon (str): Specification of the taxonomic level of the taxon child nodes (not the taxonomic level of the parent node)
        model_name (str): The complete model name (parent taxon label and model abbreviation make the combined name unique)
        training_history (str): File name at which to save the model training history.
        validation_file (str): File name at which to save the validation dataset.
    Returns:
        (None): This method returns nothing. The return statement was used to ensure case stopping.
    """
    match model:
        case 'Neural network':
            return neural_network_model.neural_network_process(df, target_taxon, model_name,
                                                               training_history, validation_file)
        case 'Decision tree':
            return decision_tree.decision_tree_process(df, target_taxon, model_name, training_history,
                                                       validation_file)
        case 'Random forest':
            return random_forest.random_forest_process(df, target_taxon, model_name, training_history,
                                                       validation_file)
        case 'Xgboost':
            return xgboost_model.xgboost_process(df, target_taxon, model_name, training_history,
                                                 validation_file)
        case 'AdaBoost':
            return adaboost_model.adaboost_process(df, target_taxon, model_name, training_history,
                                                   validation_file)

model_simplification(df, model, target_taxon, model_save_type, file_name_start)

This method simplifies the model training, testing, and saving process. It completes a full model training, testing, and saving for the specified model and dataset

Parameters:

Name Type Description Default
df DataFrame

The combined observation and metadata dataframe with the taxonomic parent node restriction applied. Only taxonomic child labels are present in df.

required
model str

Specification of the model to be used to classify the taxonomic child nodes.

required
target_taxon str

Specification of the taxonomic level of the taxon child nodes (not the taxonomic level of the parent node.

required
model_save_type str

Specification of the model file type/ model suffix.

required
file_name_start str

The standardized taxon parent node label which will be used to construct a unique file name for the trained model.

required
Source code in src/models/meta/model_training.py
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
def model_simplification(df: pd.DataFrame,
                         model: str,
                         target_taxon,
                         model_save_type: str,
                         file_name_start: str):
    """This method simplifies the model training, testing, and saving process. It completes a full model training,
    testing, and saving for the specified model and dataset

    Args:
        df (DataFrame): The combined observation and metadata dataframe with the taxonomic parent node restriction applied. Only taxonomic child labels are present in df.
        model (str): Specification of the model to be used to classify the taxonomic child nodes.
        target_taxon (str): Specification of the taxonomic level of the taxon child nodes (not the taxonomic level of the parent node.
        model_save_type (str): Specification of the model file type/ model suffix.
        file_name_start (str): The standardized taxon parent node label which will be used to construct a unique file name for the trained model.
    """

    model_abbr = model_abbreviations[model]  # Get model abbreviation

    model_name = file_name_start + "_" + model_abbr + "_model" + model_save_type  # Put together essential file names
    training_history = file_name_start + "_" + model_abbr + '_training_accuracy.csv'
    validation_file = file_name_start + "_" + model_abbr + '_validation.csv'

    model_selection_execution(model, df, target_taxon, model_name, training_history, validation_file)  # Select model, and execute the required process (data pipeline, model training, and evaluation)

taxonomic_analysis(df)

This method performs the taxonomic breakdown of the dataset at the following taxonomic levels: taxon_family_name, taxon_genus_name, taxon_species_name, subspecies

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing the unrestricted observations and metadata to perform a taxonomic breakdown of the entire dataset

required

Returns:

Type Description
dict

Keys specify the taxonomic level and the values are a list containing all unique labels in the dataset, forming a taxonomic breakdown

Source code in src/models/meta/model_training.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
def taxonomic_analysis(df: pd.DataFrame):
    """This method performs the taxonomic breakdown of the dataset at the following taxonomic levels: taxon_family_name, taxon_genus_name, taxon_species_name, subspecies

    Args:
        df (DataFrame): The dataframe containing the unrestricted observations and metadata to perform a taxonomic breakdown of the entire dataset

    Returns:
        (dict): Keys specify the taxonomic level and the values are a list containing all unique labels in the dataset, forming a taxonomic breakdown
    """
    taxonomy_list = ['taxon_family_name', 'taxon_genus_name', 'taxon_species_name', 'sub_species']   # Taxonomic levels to target in breakdown
    taxon_breakdown = dict()  # Create an empty dictionary

    for taxon in taxonomy_list:  # Iterate through the taxonomic levels
        df = df.dropna(subset=[taxon])  # Remove all n/a labels
        taxon_breakdown[taxon] = df[taxon].unique().tolist()  # Find unique taxon level labels. These form the values to the taxonomic key in the dictionary
    return taxon_breakdown  # Return the taxonomic breakdown dictionary

taxonomic_level_modelling(observation_file, metadata_file, model)

This method performs a taxonomic level breakdown and training at all taxonomic levels for the specified model.

This method performs the dataset taxonomic restriction at the parent node, modifying the dataset to fit each taxonomic parent node, such that only the taxonomic children of the parent node are within the dataset. This is done for the entire taxonomic structure within the dataset.

Parameters:

Name Type Description Default
observation_file str

The processed iNaturalist observations dataset.

required
metadata_file str

The corresponding metadata for the observation file.

required
model str

String specification of the model to be trained ane evaluated.

required

Returns:

Name Type Description
models list

A list of file names, where the file name specified the taxonomic parent node (model classifies the taxonomic children)

taxon_targets list

Species the list of the taxonomic target levels in the same order as the models list.

Source code in src/models/meta/model_training.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
def taxonomic_level_modelling(observation_file: str, metadata_file: str, model: str):
    """This method performs a taxonomic level breakdown and training at all taxonomic levels for the specified model.

    This method performs the dataset taxonomic restriction at the parent node, modifying the dataset to fit each taxonomic parent node,
    such that only the taxonomic children of the parent node are within the dataset. This is done for the entire taxonomic structure within the dataset.

    Args:
        observation_file (str): The processed iNaturalist observations dataset.
        metadata_file (str): The corresponding metadata for the observation file.
        model (str): String specification of the model to be trained ane evaluated.

    Returns:
        models (list): A list of file names, where the file name specified the taxonomic parent node (model classifies the taxonomic children)
        taxon_targets (list): Species the list of the taxonomic target levels in the same order as the models list.
    """
    # Collection of taxon level information for notebook use
    models = []  # Collection of model file names
    taxon_targets = []  # Collection of taxonomic target levels.

    df_prime = pipelines.aggregate_data(observation_file, metadata_file)  # Aggregate observations with metadata

    df_prime = df_prime[df_prime['taxon_species_name'] != 'Felis catus']  # Remove common household cat from dataset

    df_prime = df_prime.apply(lambda x: sub_species_detection(x), axis=1) # Extract subspecies labels

    taxon_breakdown = taxonomic_analysis(df_prime.copy())  # Generate a taxonomic breakdown (dictionary with taxon level as the keys, and a list of taxonomic labels as the values)
    taxonomic_keys = list(taxon_breakdown.keys())  # Extract taxonomic levels (keys)

    print('<----------->')
    print('Taxon breakdown: ', taxon_breakdown)

    for i in range(len(taxonomic_keys) - 1):  # Iterate through taxonomic keys until species (taxonomic parent node level)
        if len(taxon_breakdown[taxonomic_keys[i + 1]]) > 1:  # Ensure there are more than a single child in the taxonomic level below
            taxon_parent_level = taxonomic_keys[i]

            # Restriction
            for restriction in taxon_breakdown[taxonomic_keys[i]]:  # Restrict the dataset to one of the labels at the parent note
                df = df_prime.copy()

                df = df[df[taxonomic_keys[i]] == restriction]  # Enforce taxon restriction

                target_taxon = taxonomic_keys[i + 1]  # Extract the target taxon (taxonomic child level)
                df = df.dropna(subset=[target_taxon])  # Remove any NaN labels at this taxonomic level

                # Taxonomic level clean-up to determine number of classes (with restriction)
                df = df.dropna(subset=['public_positional_accuracy'])  # Remove n/a entries
                df = df[df['public_positional_accuracy'] <= 40000]  # Remove entries with inadequate accuracy
                df = df[df.groupby(target_taxon).common_name.transform('count') >= 5].copy()  # Enforce at least 10 observations

                if df[target_taxon].nunique() <= 1:  # Check at least two classes present with restriction
                    continue

                file_start = generate_file_name_start(restriction)  # Generate file_start_name based on the parent taxon level and the restriction (parent node)

                # Print Information
                print('------------------------------')
                print("Taxonomic Parent level: ", taxon_parent_level)
                print("Restriction: ", restriction)
                print('Target taxon: ', target_taxon)
                print('Model: ', model)

                # Execute the model training with the restricted and processed data
                model_simplification(df=df,
                                     model=model,
                                     target_taxon=target_taxon,
                                     model_save_type=model_save_types[model],
                                     file_name_start=file_start)
                models.append(file_start)  # Save the model file name for notebook input
                taxon_targets.append(target_taxon)
    return models, taxon_targets

train_base_model(model, target_taxon, file_name='base_meta')

This method trains the root node of the taxonomic tree.

The current model training requires the Felid and Elephant datasets to be kept separate to train all of their relevant taxonomic models. This however excludes the root classifier to determine between the two taxon families. This method ensures the taxonomic root is trained. Note, this method can be used to train a metadata global classifier by specifying the target taxom to the species level.

Parameters:

Name Type Description Default
model str

Specification of the model to be used to classify the taxonomic child nodes.

required
target_taxon str

Specification of the taxonomic level of the taxon child nodes (not the taxonomic level of the parent node)

required
file_name str

The file name of the root classification model.

'base_meta'
Source code in src/models/meta/model_training.py
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
def train_base_model(model: str, target_taxon, file_name='base_meta'):
    """This method trains the root node of the taxonomic tree.

    The current model training requires the Felid and Elephant datasets to be kept separate to train all of their relevant taxonomic models.
    This however excludes the root classifier to determine between the two taxon families.
    This method ensures the taxonomic root is trained.
    Note, this method can be used to train a metadata global classifier by specifying the target taxom to the species level.

    Args:
        model (str): Specification of the model to be used to classify the taxonomic child nodes.
        target_taxon (str): Specification of the taxonomic level of the taxon child nodes (not the taxonomic level of the parent node)
        file_name (str): The file name of the root classification model.
    """
    df_felids = pipelines.aggregate_data('felids_train.csv', 'felids_meta.csv')  # Aggregate felid observations and metadata
    df_proboscidia = pipelines.aggregate_data('proboscidia_train.csv', 'proboscidia_meta.csv')  # Aggregate elephant observations and metadata
    df = pd.concat([df_felids, df_proboscidia])  # Joint both felid and elephant datasets into one

    silhouette_k_means.k_max = 84  # Modify silhouette score process to increase the range and interval due to the large amount of datapoints.
    silhouette_k_means.k_interval = 20

    model_suffix = model_abbreviations[model]  # Get model suffix
    model_selection_execution(model,
                              df,
                              target_taxon,
                              file_name + '_model' + model_suffix,
                              file_name + '_training_accuracy',
                              file_name + '_validation.csv')  # Train model on one model type at a time