Skip to content

Pipelines

This file establishes the dynamic pipelines used to produce training, test, and validation sets from the dataset.

A dynamic pipeline accepts processed data, and transforms it into: training data, test data, validation data with the specified labels. The pipelines account for the variable taxonomic levels and the encoding of the location feature, to produce the above transformations.

Note, the encoding of the location feature occurs within the pipeline processes. Please review the Silhouette score documentation for further information on the process.

Attributes:

Name Type Description
root_path str

The path to the project root.

data_path str

The path to where the data is stored within the project

save_path str

The path to where models and validation data (if created) is saved. To train the models used in ensemble use /models/meta/. To metamodel notebook comparison use /notebooks/meta_modelling/model_comparison_cache/. For the notebook, please ensure the directory is made.

validation_set_flag bool

A boolean flag indicating whether a validation set should be created and saved. The validation set is saved to save_path. Each file will have suffixx _validation.csv

aggregate_data(observation_file, meta_file)

This method aggregates the original observations with the collected metadata to form a single cohesive dataframe

Parameters:

Name Type Description Default
observation_file str

The file name, that points to the file containing the processed iNaturalist observations

required
meta_file str

The file name, that points to the file containing the metadata for the processed iNaturalist observations

required
Source code in src/models/meta/pipelines.py
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
def aggregate_data(observation_file: str, meta_file: str) -> pd.DataFrame:
    """This method aggregates the original observations with the collected metadata to form a single cohesive dataframe

    Args:
        observation_file (str): The file name, that points to the file containing the processed iNaturalist observations
        meta_file (str): The file name, that points to the file containing the metadata for the processed iNaturalist observations
    """
    obs_df = pd.read_csv(root_path + data_path + observation_file, index_col=0)  # Read in the csv files
    meta_df = pd.read_csv(root_path + data_path + meta_file, index_col=0)

    obs_df = obs_df.drop(columns=['observed_on', 'local_time_observed_at', 'positional_accuracy'])  # Drop repeated/ non-essential columns
    meta_df = meta_df.drop(columns=['lat', 'long', 'time'])

    df = pd.merge(obs_df, meta_df, how='inner', left_index=True, right_index=True)  # Merge the two dataframes
    return df

dark_light_calc(x)

This method performs the dark/ light feature creation based on the time of observation and the sunrise & sunset times

Parameters:

Name Type Description Default
x DataFrame row

This variable represents a dataframe row.

required

Returns:

Type Description
DataFrame row

The method returns the dataframe row with a new binary 'light' column

Source code in src/models/meta/pipelines.py
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
def dark_light_calc(x):
    """This method performs the dark/ light feature creation based on the time of observation and the sunrise & sunset times
    Args:
        x (DataFrame row): This variable represents a dataframe row.

    Returns:
        (DataFrame row): The method returns the dataframe row with a new binary 'light' column
    """
    timezone = pytz.timezone(x['time_zone'])  # Extract the timezone
    sunrise_utc = x['sunrise']
    sunset_utc = x['sunset']

    observ_time = x['observed_on'].replace(tzinfo=pytz.utc)  # Place time zone info with utc (this is required to localize the timezone in the next step)
    observ_time = x['observed_on'].astimezone(timezone)  # Generate the observed timezone in local time

    x['light'] = int(sunrise_utc <= observ_time <= sunset_utc)  # Logical operators cast into integer form create the binary light (1) or dark (0) value
    return x

day_night_calculation(df)

This method provides the overall process to create the light/ dark feature.

This method converts the time of observation, sunrise, and sunset into local times. Local times are compared to determine light or dark. The sunrise and sunset columns are removed as they are no longer required.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all observation data from the processed data directory.

required

Returns:

Type Description
DataFrame

The dataframe with the additional light column, and the sunrise & sunset columns removed.

Source code in src/models/meta/pipelines.py
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
def day_night_calculation(df: pd.DataFrame):
    """This method provides the overall process to create the light/ dark feature.

    This method converts the time of observation, sunrise, and sunset into local times.
    Local times are compared to determine light or dark.
    The sunrise and sunset columns are removed as they are no longer required.

    Args:
        df (DataFrame): The dataframe containing all observation data from the processed data directory.

    Returns:
        (DataFrame): The dataframe with the additional light column, and the sunrise & sunset columns removed.
    """
    # Convert to datetime objects. Remove NaT values from resulting transformation
    df['sunrise'] = pd.to_datetime(df['sunrise'],
                                   format="%Y-%m-%dT%H:%M",
                                   errors='coerce')
    df['sunset'] = pd.to_datetime(df['sunset'],
                                  format="%Y-%m-%dT%H:%M",
                                  errors='coerce')
    df = df.dropna(subset=['sunrise', 'sunset'])

    df = df.apply(lambda x: localize_sunrise_sunset(x), axis=1)  # Localize sunrise and sunset times to be timezone aware

    df = df.apply(lambda x: dark_light_calc(x), axis=1)  # Dark/ light calc based on sunrise and sunset times

    df = df.drop(columns=['sunrise', 'sunset'])

    return df

decision_tree_data(df, taxon_target, validation_file)

Method to create the train/set/validation data to be used by the decision tree/ random forest/ Adaboost models

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all data for each observation.

required
taxon_target str

The taxonomic target level, to extract the correct labels (taxon_family_name, taxon_genus_name, taxon_species_name, subspecies)

required
validation_file str

The name of the file where the validation data will be stored. Also informs the name of the saved models.

required

Returns:

Name Type Description
X DataFrame

The features in a form suitable for direct use within the models.

y Series

The labels for the corresponding observations as the correct taxonomic level.

Source code in src/models/meta/pipelines.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def decision_tree_data(df: pd.DataFrame, taxon_target: str, validation_file: str):
    """Method to create the train/set/validation data to be used by the decision tree/ random forest/ Adaboost models

    Args:
        df (DataFrame): The dataframe containing all data for each observation.
        taxon_target (str): The taxonomic target level, to extract the correct labels (taxon_family_name, taxon_genus_name, taxon_species_name, subspecies)
        validation_file (str): The name of the file where the validation data will be stored. Also informs the name of the saved models.

    Returns:
        X (DataFrame): The features in a form suitable for direct use within the models.
        y (Series): The labels for the corresponding observations as the correct taxonomic level.
    """
    k_means = silhouette_k_means.silhouette_process(df, validation_file)
    X, y = tree_pipeline(df, k_means, taxon_target, validation_file)
    return X, y

elevation_clean(x)

This method performs a logical check on each observation's elevation based on the land feature value.

The Open-Meteo API sets elevation to be 0m if the elevation is unknown. If the elevation is 0m and the land value is 1 (indicating a terrestrial sighting), then the elevation is set to NaN value. This NaN value will be modified within the pipeline with the species average elevation

Parameters:

Name Type Description Default
x DataFrame row

This variable represents a dataframe row containing the 'land' column.

required

Returns:

Type Description
DataFrame row

The method returns the dataframe row with the 'elevation' feature adjusted.

Source code in src/models/meta/pipelines.py
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
def elevation_clean(x):  # If observation is terrestrial, 0.0m elevation requires modification
    """This method performs a logical check on each observation's elevation based on the land feature value.

    The Open-Meteo API sets elevation to be 0m if the elevation is unknown.
    If the elevation is 0m and the land value is 1 (indicating a terrestrial sighting), then the elevation is set to NaN value.
    This NaN value will be modified within the pipeline with the species average elevation

    Args:
        x (DataFrame row): This variable represents a dataframe row containing the 'land' column.

    Returns:
        (DataFrame row): The method returns the dataframe row with the 'elevation' feature adjusted.
    """
    land = x['land']
    elevation = x['elevation']

    if land == 1 and elevation == 0:
        x['elevation'] = np.nan
    return x

general_pipeline(df, k_means, taxon_target)

Method performs general pipeline functions for all model types (Neural network, XGBoost, AdaBoost, Decision tree, Random Forest)

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all observation data from the processed data directory.

required
k_means KMeans

The trained K-means model that performs the location encoding

required
taxon_target str

The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)

required

Returns:

Type Description
DataFrame

A dataframe containing cleaned, transformed, and new data features for further specified processing depending on the model.

Source code in src/models/meta/pipelines.py
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
def general_pipeline(df: pd.DataFrame, k_means: KMeans, taxon_target: str):
    """Method performs general pipeline functions for all model types (Neural network, XGBoost, AdaBoost, Decision tree, Random Forest)

    Args:
        df (DataFrame): The dataframe containing all observation data from the processed data directory.
        k_means (KMeans): The trained K-means model that performs the location encoding
        taxon_target (str): The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)

    Returns:
        (DataFrame): A dataframe containing cleaned, transformed, and new data features for further specified processing depending on the model.
    """
    # Data Cleaning
    df = df.drop(columns=['geoprivacy', 'taxon_geoprivacy', 'taxon_id', 'license', 'image_url'])  # Remove non-essential columns
    df = df.dropna(subset=['taxon_species_name', 'public_positional_accuracy'])  # Remove null species names and positional accuracies
    df = df[df['public_positional_accuracy'] <= 40000]  # Positional Accuracy Restriction
    df = df.drop(columns=['public_positional_accuracy'])  # Drop the public positional accuracy column

    # Transformations
    df = df.apply(lambda x: sub_species_detection(x), axis=1)  # Generate subspecies labels
    df = df.drop(columns=['scientific_name'])  # Drop the scientific name column

    df = df[df.groupby(taxon_target).common_name.transform('count') >= 5].copy()  # Remove species with less than 5 observations

    df['location_cluster'] = k_means.predict(df[['latitude', 'longitude']])  # Location encoding using K-means

    df['land'] = 1  # All observations from dataset are terrestrial. For unknown datasets use the `land_mask()` method to automate the feature value

    df = df.apply(lambda x: elevation_clean(x), axis=1)  # Clean elevation values. In aquatic observations, the max elevation is sea level 0m
    df['elevation'] = df['elevation'].fillna(df.groupby('taxon_species_name')['elevation'].transform('mean'))  # If elevation is missing, interpolate with mean species elevation

    df['hemisphere'] = (df['latitude'] >= 0).astype(int)  # Northern/ Southern hemisphere feature
    df = df.drop(columns=['latitude', 'longitude']) # Remove longitude and latitude columns

    df['observed_on'] = pd.to_datetime(df['observed_on'], format="%Y-%m-%d %H:%M:%S%z", utc=True)  # Datetime transform into datetime object
    df['month'] = df['observed_on'].dt.month  # Month feature
    df['hour'] = df.apply(lambda x: x['observed_on'].astimezone(pytz.timezone(x['time_zone'])).hour, axis=1)  # Local time zone hour feature
    df = day_night_calculation(df)  # Day/ night feature
    df = df.apply(lambda x: season_calc(x), axis=1)  # Season feature into categorical values
    df = ohe_season(df)  # One-hot-encode the categorical season values

    return df

land_mask(x)

This method determines if the observation coordinates are terrestrial or aquatic in nature.

This method uses the Globe library to evaluate the location against a land mask. If the observation is terrestrial a value of 1 is given. If not 0 is given.

Parameters:

Name Type Description Default
x DataFrame row

This variable represents a dataframe row containing the 'latitude' and 'longitude' columns.

required

Returns:

Type Description
DataFrame row

The method returns the dataframe row with an additional binary column value 'land'

Source code in src/models/meta/pipelines.py
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
def land_mask(x):
    """This method determines if the observation coordinates are terrestrial or aquatic in nature.

    This method uses the Globe library to evaluate the location against a land mask.
    If the observation is terrestrial a value of 1 is given. If not 0 is given.

    Args:
        x (DataFrame row): This variable represents a dataframe row containing the 'latitude' and 'longitude' columns.

    Returns:
        (DataFrame row): The method returns the dataframe row with an additional binary column value 'land'
    """
    latitude = x['latitude']
    longitude = x['longitude']
    x['land'] = int(globe.is_land(latitude, longitude))
    return x

localize_sunrise_sunset(x)

This method localizes the sunrise and sunset times based on the time zone to aid in the light/ dark feature.

Parameters:

Name Type Description Default
x DataFrame row

This variable represents a dataframe row containing the 'sunrise' and 'sunset' columns.

required

Returns:

Type Description
DataFrame row

The method returns the dataframe row with the 'sunrise' and 'sunset' features adjusted.

Source code in src/models/meta/pipelines.py
415
416
417
418
419
420
421
422
423
424
425
426
def localize_sunrise_sunset(x):
    """This method localizes the sunrise and sunset times based on the time zone to aid in the light/ dark feature.
    Args:
        x (DataFrame row): This variable represents a dataframe row containing the 'sunrise' and 'sunset' columns.

    Returns:
        (DataFrame row): The method returns the dataframe row with the 'sunrise' and 'sunset' features adjusted.
    """
    timezone = pytz.timezone(x['time_zone'])
    x['sunrise'] = x['sunrise'].replace(tzinfo=timezone)
    x['sunset'] = x['sunset'].replace(tzinfo=timezone)
    return x

neural_network_data(df, taxon_target, validation_file)

Method to create the train/set/validation data to be used by the neural network model.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all data for each observation.

required
taxon_target str

The taxonomic target level, to extract the correct labels (taxon_family_name, taxon_genus_name, taxon_species_name, subspecies)

required
validation_file str

The name of the file where the validation data will be stored. Also informs the name of the saved models.

required

Returns:

Name Type Description
X DataFrame

The features in a form suitable for direct use within the models.

y Series

The labels for the corresponding observations as the correct taxonomic level.

classes int

The number of classes data labels

Source code in src/models/meta/pipelines.py
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def neural_network_data(df: pd.DataFrame, taxon_target: str, validation_file: str):
    """Method to create the train/set/validation data to be used by the neural network model.

    Args:
        df (DataFrame): The dataframe containing all data for each observation.
        taxon_target (str): The taxonomic target level, to extract the correct labels (taxon_family_name, taxon_genus_name, taxon_species_name, subspecies)
        validation_file (str): The name of the file where the validation data will be stored. Also informs the name of the saved models.

    Returns:
        X (DataFrame): The features in a form suitable for direct use within the models.
        y (Series): The labels for the corresponding observations as the correct taxonomic level.
        classes (int): The number of classes data labels
    """
    k_means = silhouette_k_means.silhouette_process(df, validation_file)
    X, y, classes = nn_pipeline(df, k_means, taxon_target, validation_file)
    return X, y, classes

nn_binary_label_handling(y)

Method handles the OHE of a binary case to ensure that OHE values returned are of the form [1, 0] or [0, 1].

Parameters:

Name Type Description Default
y Series

The labels in the form of either 1 or 0 to be transformed into a binary OHE

required

Returns:

Type Description
Series

Returns a Series containing OHE labels of the form [1, 0] or [0, 1]

Source code in src/models/meta/pipelines.py
332
333
334
335
336
337
338
339
340
341
def nn_binary_label_handling(y):
    """Method handles the OHE of a binary case to ensure that OHE values returned are of the form [1, 0] or [0, 1].

    Args:
        y (Series): The labels in the form of either 1 or 0 to be transformed into a binary OHE

    Returns:
        (Series): Returns a Series containing OHE labels of the form [1, 0] or [0, 1]
    """
    return np.hstack((1 - y.reshape(-1, 1), y.reshape(-1, 1)))

nn_pipeline(df, k_means, taxon_target, validation_file)

This method performs further data processing to structure and format it for use in the Neural Network model

This method performs similar processing steps to both the decision tree and XGBoost pipelines. However, categorical variables are required to be OHE and the resulting features are normalized for use in the model.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all observation data from the processed data directory.

required
k_means KMeans

The trained K-means model that performs the location encoding

required
taxon_target str

The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)

required
validation_file str

The name of file to store validation data. Informs model naming as well.

required

Returns:

Name Type Description
X DataFrame

A dataframe containing features in rows and observations in column ready for use as input features to the models for training and evaluation. These features are normalized.

y Series

The OHE encoding of the observation labels at the correct taxonomic level specified.

Source code in src/models/meta/pipelines.py
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
def nn_pipeline(df, k_means, taxon_target, validation_file: str):
    """This method performs further data processing to structure and format it for use in the Neural Network model

    This method performs similar processing steps to both the decision tree and XGBoost pipelines.
    However, categorical variables are required to be OHE and the resulting features are normalized for use in the model.

    Args:
        df (DataFrame): The dataframe containing all observation data from the processed data directory.
        k_means (KMeans): The trained K-means model that performs the location encoding
        taxon_target (str): The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)
        validation_file (str): The name of file to store validation data. Informs model naming as well.

    Returns:
        X (DataFrame): A dataframe containing features in rows and observations in column ready for use as input features to the models for training and evaluation. These features are normalized.
        y (Series): The OHE encoding of the observation labels at the correct taxonomic level specified.
    """
    df = general_pipeline(df, k_means, taxon_target)

    # Generate dummy variables for categorical features
    df = pd.get_dummies(df, prefix='loc', columns=['location_cluster'], drop_first=True)  # OHE location cluster feature
    df = pd.get_dummies(df, prefix='hr', columns=['hour'], drop_first=True)  # OHE hour feature
    df = pd.get_dummies(df, prefix='mnth', columns=['month'], drop_first=True)  # OHE month feature

    df = df.drop(columns=['observed_on', 'time_zone'])  # Drop observed on column as date & time transformations are complete
    df = validation_set(df, taxon_target, validation_file)  # Create validation set for further testing

    # Data formatting
    taxon_y = df[taxon_target]  # Retrieve labels at taxonomic target level

    if taxon_y.isnull().any():  # If no taxonomic label is present, remove the observation
        df = df.dropna(subset=[taxon_target])

    y = df[taxon_target]  # Extract labels
    X = df.drop(columns=['taxon_kingdom_name', 'taxon_phylum_name',
                         'taxon_class_name', 'taxon_order_name', 'taxon_family_name',
                         'taxon_genus_name', 'taxon_species_name', 'sub_species', 'common_name'])  # Extract features only

    X, y = over_sample(X, y)  # Resample dataset to reduce imbalance

    y, classes = ohe_labels(y)  # OHE labels

    # Normalize data using min-max approach
    norm_columns = ['apparent_temperature', 'apparent_temperature_max', 'apparent_temperature_min',
                    'cloudcover', 'cloudcover_high', 'cloudcover_low', 'cloudcover_mid', 'dewpoint_2m',
                    'diffuse_radiation', 'direct_radiation', 'elevation', 'et0_fao_evapotranspiration_daily',
                    'et0_fao_evapotranspiration_hourly', 'precipitation', 'precipitation_hours',
                    'precipitation_sum', 'rain', 'rain_sum', 'relativehumidity_2m', 'shortwave_radiation',
                    'shortwave_radiation_sum', 'snowfall', 'snowfall_sum', 'soil_moisture_0_to_7cm',
                    'soil_moisture_28_to_100cm', 'soil_moisture_7_to_28cm', 'soil_temperature_0_to_7cm',
                    'soil_temperature_28_to_100cm', 'soil_temperature_7_to_28cm', 'surface_pressure',
                    'temperature_2m', 'temperature_2m_max', 'temperature_2m_min', 'vapor_pressure_deficit',
                    'winddirection_100m', 'winddirection_10m', 'winddirection_10m_dominant',
                    'windgusts_10m', 'windgusts_10m_max', 'windspeed_100m', 'windspeed_10m',
                    'windspeed_10m_max']
    X[norm_columns] = StandardScaler().fit_transform(X[norm_columns])
    return X, y, classes

north_south_calc(month, seasons)

This method determined the current season when provided with a month and a list of seasonal months.

Note, this method is used within the season_calc() method.

Args: month (int): The integer value of the month of sighting [1-12] seasons (list): The list of months seperated by season, starting with winter. Example of northern hemisphere [[12, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]

Returns:

Type Description
str

The season categorical variable

Source code in src/models/meta/pipelines.py
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
def north_south_calc(month: int, seasons: list):
    """This method determined the current season when provided with a month and a list of seasonal months.

     Note, this method is used within the `season_calc()` method.

     Args:
         month (int): The integer value of the month of sighting [1-12]
         seasons (list): The list of months seperated by season, starting with winter. Example of northern hemisphere [[12, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]

    Returns:
        (str): The season categorical variable
     """
    seasons_dict = {0: 'Winter', 1: 'Spring', 2: 'Summer', 3: 'Autumn'}
    season_id = 0
    for i in range(len(seasons)):
        if month in seasons[i]:
            season_id = i
    return seasons_dict[season_id]

ohe_labels(y)

This method encodes the taxonomic labels in a One-hot-encoded format.

Special consideration is enforced for binary labels such that the resulting ohe labels are of the form [0, 1] or [1, 0]

Parameters:

Name Type Description Default
y Series

The categorical taxonomic labels

required

Returns:

Type Description
Series

OHE taxonomic labels

Source code in src/models/meta/pipelines.py
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
def ohe_labels(y):
    """This method encodes the taxonomic labels in a One-hot-encoded format.

    Special consideration is enforced for binary labels such that the resulting ohe labels are of the form [0, 1] or [1, 0]

    Args:
        y (Series): The categorical taxonomic labels

    Returns:
        (Series): OHE taxonomic labels
    """
    classes = y.nunique()  # OHE encode the labels
    lb = LabelBinarizer()
    lb.fit(y)
    y = lb.transform(y)

    if classes == 2:  # Modification required if only two classes are present
        y = nn_binary_label_handling(y)
    return y, classes

ohe_month(df)

Method performs OHE on the month feature of each observation

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all observation data from the processed data directory.

required
Source code in src/models/meta/pipelines.py
361
362
363
364
365
366
367
368
369
370
371
372
373
def ohe_month(df: pd.DataFrame):
    """Method performs OHE on the month feature of each observation

    Args:
        df (DataFrame): The dataframe containing all observation data from the processed data directory.
    """
    cats = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]  # Initialise the possible categories
    cat_type = CategoricalDtype(categories=cats)

    df['season'] = df['season'].astype(cat_type)

    df = pd.get_dummies(df, prefix='szn', columns=['season'], drop_first=True)  # Perform OHE
    return df

ohe_season(df)

This method OHE the season categorical feature

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all observation data from the processed data directory.

required

Returns:

Type Description
DataFrame

The dataframe with the season feature OHE (this results in additional columns within the dataframe)

Source code in src/models/meta/pipelines.py
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
def ohe_season(df):
    """This method OHE the season categorical feature

    Args:
        df (DataFrame): The dataframe containing all observation data from the processed data directory.

    Returns:
        (DataFrame): The dataframe with the season feature OHE (this results in additional columns within the dataframe)
    """
    cats = ['Winter', 'Spring', 'Summer', 'Autumn']  # Season categorical variables to expect
    cat_type = CategoricalDtype(categories=cats)

    df['season'] = df['season'].astype(cat_type)

    df = pd.get_dummies(df,
                        prefix='szn',
                        columns=['season'],
                        drop_first=True)
    return df

over_sample(X, y)

This method performs oversampling on the dataset in order to provide a more balanced data distribution, to combat the tail-end distribution (characteristic of wildlife data).

Note, the oversampling aimed to increase the quantity of observations in minority classes to achieve a more even distribution.

Parameters:

Name Type Description Default
X DataFrame

The dataset's observation features to be used in model training and evaluation.

required
y Series

The label for each observation (still categorical)

required

Returns:

Name Type Description
X_res DataFrame

The features dataset with additional observations due to the oversampling

y_res Series

An associated dataframe containing the observation labels, including for the additional observations created.

Source code in src/models/meta/pipelines.py
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
def over_sample(X, y):
    """This method performs oversampling on the dataset in order to provide a more balanced data distribution, to combat the tail-end distribution (characteristic of wildlife data).

    Note, the oversampling aimed to increase the quantity of observations in minority classes to achieve a more even distribution.

    Args:
        X (DataFrame): The dataset's observation features to be used in model training and evaluation.
        y (Series): The label for each observation (still categorical)

    Returns:
         X_res (DataFrame): The features dataset with additional observations due to the oversampling
         y_res (Series): An associated dataframe containing the observation labels, including for the additional observations created.
    """
    ros = RandomOverSampler(sampling_strategy='minority', random_state=2)
    X_res, y_res = ros.fit_resample(X, y)
    return X_res, y_res

season_calc(x)

This method determines the season in which an observation occurred based on the month of observation

Parameters:

Name Type Description Default
x DataFrame row

This variable represents a dataframe row.

required

Returns:

Type Description
DataFrame row

The method returns the dataframe row with a new season feature

Source code in src/models/meta/pipelines.py
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
def season_calc(x):
    """This method determines the season in which an observation occurred based on the month of observation

    Args:
        x (DataFrame row): This variable represents a dataframe row.

    Returns:
        (DataFrame row): The method returns the dataframe row with a new season feature
    """
    hemisphere = x['hemisphere']
    month = x['month']
    season = 0

    if hemisphere == 1:  # Northern hemisphere
        winter, spring, summer, autumn = [12, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]  # Order of months starting with winter season
        seasons = [winter, spring, summer, autumn]
        season = north_south_calc(month, seasons)
    else:  # Southern hemisphere
        winter, spring, summer, autumn = [6, 7, 8], [9, 10, 11], [12, 1, 2], [3, 4, 5]  # Order of months starting with winter seasons
        seasons = [winter, spring, summer, autumn]
        season = north_south_calc(month, seasons)

    x['season'] = season
    return x

sub_species_detection(x)

Method uses the scientific name of observations to extract the subspecies name when there are more than three words present (3 names describe a subspecies)

Parameters:

Name Type Description Default
x DataFrame row

This variable represents a dataframe row containing the 'scientific_name' column.

required

Returns:

Type Description
DataFrame row

The method returns the dataframe row with an additional column value 'sub_species' if it could be extracted from the scientific name.

Source code in src/models/meta/pipelines.py
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
def sub_species_detection(x):
    """Method uses the scientific name of observations to extract the subspecies name when there are more than
    three words present (3 names describe a subspecies)

    Args:
        x (DataFrame row): This variable represents a dataframe row containing the 'scientific_name' column.

    Returns:
        (DataFrame row): The method returns the dataframe row with an additional column value 'sub_species' if it could be extracted from the scientific name.
    """
    name_count = len(x['scientific_name'].split())  # Determine the number of names in the scientific name
    x['sub_species'] = np.nan  # Initialize the subspecies value
    if name_count >= 3:
        x['sub_species'] = x['scientific_name']
    return x

tree_pipeline(df, k_means, taxon_target, validation_file)

This method performs further data processing to structure and format it for use in a decision tree, random forest and adaboost models.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all observation data from the processed data directory.

required
k_means KMeans

The trained K-means model that performs the location encoding

required
taxon_target str

The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)

required
validation_file str

The name of file to store validation data. Informs model naming as well.

required

Returns:

Name Type Description
X DataFrame

A dataframe containing features in rows and observations in column ready for use as input features to the models for training and evaluation.

y Series

The categorical labels of the associated observations at the correct taxonomic level specified.

Source code in src/models/meta/pipelines.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
def tree_pipeline(df, k_means, taxon_target, validation_file: str):
    """This method performs further data processing to structure and format it for use in a decision tree, random forest and adaboost models.

    Args:
        df (DataFrame): The dataframe containing all observation data from the processed data directory.
        k_means (KMeans): The trained K-means model that performs the location encoding
        taxon_target (str): The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)
        validation_file (str): The name of file to store validation data. Informs model naming as well.

    Returns:
        X (DataFrame): A dataframe containing features in rows and observations in column ready for use as input features to the models for training and evaluation.
        y (Series): The categorical labels of the associated observations at the correct taxonomic level specified.
    """
    df = general_pipeline(df, k_means, taxon_target)  # Perform general pipeline

    df = df.drop(columns=['observed_on', 'time_zone'])  # Drop observed on column as date & time transformations are complete

    df = validation_set(df, taxon_target, validation_file)  # Create validation set for further testing

    # Data formatting
    taxon_y = df[taxon_target]  # Retrieve labels at taxonomic target level

    if taxon_y.isnull().any():  # If no taxonomic label is present, remove the observation
        df = df.dropna(subset=[taxon_target])

    y = df[taxon_target]  # Extract labels
    X = df.drop(columns=['taxon_kingdom_name', 'taxon_phylum_name',
                         'taxon_class_name', 'taxon_order_name', 'taxon_family_name',
                         'taxon_genus_name', 'taxon_species_name', 'sub_species', 'common_name'])  # Extract features only

    X, y = over_sample(X, y)  # Resample dataset to reduce imbalance
    return X, y

validation_set(df, taxon_target, file_name)

This method creates a validation set from the provided dataframe for further model evaluation

The validation set comprises 20% of each class's composition from the dataframe. The observations included in the validation set are removed from the dataframe.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all observation data from the processed data directory.

required
taxon_target str

The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)

required
file_name str

The name of the file in which the validation data will be stored.

required

Returns:

Type Description
DataFrame

The dataframe with the validation observations removed

Source code in src/models/meta/pipelines.py
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
def validation_set(df: pd.DataFrame, taxon_target: str, file_name: str):
    """This method creates a validation set from the provided dataframe for further model evaluation

    The validation set comprises 20% of each class's composition from the dataframe.
    The observations included in the validation set are removed from the dataframe.

    Args:
        df (DataFrame): The dataframe containing all observation data from the processed data directory.
        taxon_target (str): The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)
        file_name (str): The name of the file in which the validation data will be stored.

    Returns:
        (DataFrame): The dataframe with the validation observations removed
    """
    if validation_set_flag:
        grouped = df.groupby([taxon_target]).sample(frac=0.2, random_state=2)  # 20% of each class goes to the validation set
        grouped.to_csv(root_path + save_path + file_name)  # Save evaluation dataset

        df = df.drop(grouped.index)  # Remove validation observations from the current df through their index
    return df

xgb_data(df, taxon_target, validation_file)

Method to create the train/set/validation data to be used by the XGBoost model.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all data for each observation.

required
taxon_target str

The taxonomic target level, to extract the correct labels (taxon_family_name, taxon_genus_name, taxon_species_name, subspecies)

required
validation_file str

The name of the file where the validation data will be stored. Also informs the name of the saved models.

required

Returns:

Name Type Description
X DataFrame

The features in a form suitable for direct use within the models.

y Series

The labels for the corresponding observations as the correct taxonomic level.

Source code in src/models/meta/pipelines.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def xgb_data(df: pd.DataFrame, taxon_target: str, validation_file: str):
    """Method to create the train/set/validation data to be used by the XGBoost model.

    Args:
        df (DataFrame): The dataframe containing all data for each observation.
        taxon_target (str): The taxonomic target level, to extract the correct labels (taxon_family_name, taxon_genus_name, taxon_species_name, subspecies)
        validation_file (str): The name of the file where the validation data will be stored. Also informs the name of the saved models.

    Returns:
        X (DataFrame): The features in a form suitable for direct use within the models.
        y (Series): The labels for the corresponding observations as the correct taxonomic level.
    """
    k_means = silhouette_k_means.silhouette_process(df, validation_file)
    X, y = xgb_pipeline(df, k_means, taxon_target, validation_file)
    return X, y

xgb_pipeline(df, k_means, taxon_target, validation_file)

This method performs further data processing to structure and format it for use in the XGBoost model

This method makes use of the decison_tree_pipeline, simply encoding the labels in a One-Hot-Encoded (OHE) format

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all observation data from the processed data directory.

required
k_means KMeans

The trained K-means model that performs the location encoding

required
taxon_target str

The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)

required
validation_file str

The name of file to store validation data. Informs model naming as well.

required

Returns:

Name Type Description
X DataFrame

A dataframe containing features in rows and observations in column ready for use as input features to the models for training and evaluation.

y Series

The OHE encoding of the observation labels at the correct taxonomic level specified.

Source code in src/models/meta/pipelines.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
def xgb_pipeline(df, k_means, taxon_target, validation_file: str):
    """This method performs further data processing to structure and format it for use in the XGBoost model

    This method makes use of the decison_tree_pipeline, simply encoding the labels in a One-Hot-Encoded (OHE) format

    Args:
        df (DataFrame): The dataframe containing all observation data from the processed data directory.
        k_means (KMeans): The trained K-means model that performs the location encoding
        taxon_target (str): The taxonomic level at which to extract the taxon labels (taxon_family_name, taxon_genus_name, taxon_species_name, sub_species)
        validation_file (str): The name of file to store validation data. Informs model naming as well.

    Returns:
        X (DataFrame): A dataframe containing features in rows and observations in column ready for use as input features to the models for training and evaluation.
        y (Series): The OHE encoding of the observation labels at the correct taxonomic level specified.
    """
    X, y = tree_pipeline(df, k_means, taxon_target, validation_file)

    y, classes = ohe_labels(y)  # OHE labels

    return X, y