Skip to content

Taxonomic Directory Structure Documentation

This file creates a taxonomic directory structure, allocating the cropped images into the correct taxonomic level and directory.

The taxonomic directory is required for the image classification model training in the Wildlife Classification repository. It makes use of the image_dataset_from_directory() function to construct images with the corresponding labels based on the directory structure.

Attributes:

Name Type Description
root_path str

The absolute path to the root of the project directory.

cropped_img_path str

The path to the sub-images of the observations

img_path str

The path to the training directory where the images are arranged within the taxonomic directory.

test_path str

The path to the validation directory where the images are arranged within the taxonomic directory.

data_path str

The path to where the iNaturalist observations are stored.

multiple_detections_id list

The possible suffixes of the sub-images used to identify how many sub-images exist per observation.

img_size int

(528, 528) The size of the images accepted by the EfficientNet-B6 Classification model in Wildlife Classification.

test_split float

The percentage of images to be placed in the validation directory (A 15% validation split)

taxonomy_list list

The list of the taxonomic column names at which the taxon hierarchy will be based on.

count int

The number of sub-images placed in the correct directory.

length int

The total number of observations.

create_dataset(observations)

This method creates a DataFrame from the specified observations list.

Note, multiple dataset files can be specified and the taxonomic structure will remain correct. The method removes any Null values from the image_urls, removes the erroneous Felis Catus species, and drops unnecessary columns.

Returns:

Type Description
DataFrame

A dataframe of the specified observations file aggregated together.

Source code in dataset_structure.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def create_dataset(observations: list):
    """This method creates a DataFrame from the specified observations list.

    Note, multiple dataset files can be specified and the taxonomic structure will remain correct.
    The method removes any Null values from the image_urls, removes the erroneous Felis Catus species,
    and drops unnecessary columns.

    Returns:
        (DataFrame): A dataframe of the specified observations file aggregated together.
    """
    global length
    df_obs = pd.DataFrame()

    for observation in observations:
        df_obs = pd.concat([df_obs, pd.read_csv(data_path + observation)])  # Concatenate multiple observations into one dataframe

    df_obs = df_obs.dropna(subset=['image_url'])
    df_obs = df_obs[df_obs['taxon_species_name'] != 'Felis catus']
    df_obs = df_obs.drop(columns=['observed_on', 'local_time_observed_at', 'positional_accuracy'])  # Remove unnecessary columns

    length = len(df_obs)  # Instantiate length value
    return df_obs

image_access(x)

This method creates the taxonomic path based on the specified taxonomic levels. The resulting path indicates where in the taxon directory the image will be stored.

Additionally, this method splits the images into training and validation datasets. This method is used in conjunction with the DataFrame.apply() method and the lambda expression. This method updates the status bar upon saving the sub-image into the correct directory.

Parameters:

Name Type Description Default
x Row

The row of the dataframe representing a single observation.

required
Source code in dataset_structure.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
def image_access(x):
    """This method creates the taxonomic path based on the specified taxonomic levels. The resulting path indicates
     where in the taxon directory the image will be stored.

     Additionally, this method splits the images into training and validation datasets.
     This method is used in conjunction with the `DataFrame.apply()` method and the lambda expression.
     This method updates the status bar upon saving the sub-image into the correct directory.

     Args:
         x (Row): The row of the dataframe representing a single observation.
     """
    global count
    path = img_path  # Path is set to the training set by default
    set_decider = random.uniform(0, 1)  # Generate a random uniform value
    if set_decider < test_split:  # If the probability is less that the test_split the image is sent to the validation set instead.
        path = test_path

    for level in taxonomy_list:  # Iterate down the taxonomic levels
        taxon_level = x[level]  # This is the taxon label at this level

        if taxon_level is np.nan:  # If the taxon level is null, then there exist no further labels below. This is the final path
            break

        taxon_level = taxon_level.replace(" ", "_")  # Clean file path (lowercase and no spaces)
        taxon_level = taxon_level.lower()
        path = path + taxon_level + "/"

    multiple_obs(x['id'], path)  # Account for multiple observations at the same file path.

    count = count + 1  # Increase the completed observation count
    status_bar_update()  # Update the status bar

multiple_obs(id, path)

This method searches for multiple sub-images per observation in order to place them at the same file path within the taxonomic directories.

This method copies the images from the images/cropped/ directory to the target file path specified.

Parameters:

Name Type Description Default
id int

The unique observation id.

required
path str

The path to the correct taxon directory within the taxonomic directory structure. This path should be the same for all sub-images of an observation.

required
Source code in dataset_structure.py
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def multiple_obs(id, path):
    """This method searches for multiple sub-images per observation in order to place them at the same file path within the taxonomic directories.

    This method copies the images from the `images/cropped/` directory to the target file path specified.

    Args:
        id (int): The unique observation id.
        path (str): The path to the correct taxon directory within the taxonomic directory structure. This path should be the same for all sub-images of an observation.
    """
    for suffix in multiple_detections_id:  # Iterate through the possible sub-image suffixes
        raw_path = cropped_img_path + str(id) + '_' + suffix + '.jpg'  # Create a path with that suffix which could already exist within the cropped images directory
        if os.path.exists(raw_path):  # Path exists so copy file to destination path
            file_name = path + str(id) + '_' + suffix + '.jpg'
            os.makedirs(os.path.dirname(file_name), exist_ok=True)
            shutil.copy(raw_path, file_name)
        else:  # No file exists, stop searching for further sub-images
            break

status_bar_update()

This method updates the visual status bar to represent the status of the image download.

Source code in dataset_structure.py
160
161
162
163
164
165
166
167
168
169
170
def status_bar_update():
    """This method updates the visual status bar to represent the status of the image download.
    """
    progress_bar_length = 50
    percentage_complete = count / length
    filled = int(progress_bar_length * percentage_complete)  # Calculate the percentage of the bar completed.

    bar = '=' * filled + '-' * (progress_bar_length - filled)
    percentage_display = round(100 * percentage_complete, 5)  # Round percentage complete to 5 decimals for display
    sys.stdout.write('\r[%s] %s%s ... count: %s' % (bar, percentage_display, '%', count))
    sys.stdout.flush()

sub_species_detection(x)

This method performs subspecies detection based on the information available in the scientific name column.

This method identifies a subspecies name by definition as containing three distinct words making up the scientific name (unique to subspecies) This method is used in conjunction with the DataFrame.apply() method and the lambda expression.

Parameters:

Name Type Description Default
x Row

The row of the dataframe representing a single observation.

required

Returns:

Type Description
Row

The same row augmented to include an additional column titled sub_species.

Source code in dataset_structure.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def sub_species_detection(x):
    """This method performs subspecies detection based on the information available in the scientific name column.

    This method identifies a subspecies name by definition as containing three distinct words making up the scientific name (unique to subspecies)
    This method is used in conjunction with the `DataFrame.apply()` method and the lambda expression.

    Args:
        x (Row): The row of the dataframe representing a single observation.

    Returns:
        (Row): The same row augmented to include an additional column titled sub_species.
    """
    name_count = len(x['scientific_name'].split())
    x['sub_species'] = np.nan
    if name_count >= 3:
        x['sub_species'] = x['scientific_name']
    return x

taxonomic_analysis(df)

This method performs a taxonomic analysis, whereby it identifies each unique label at the specified taxonomic levels.

This method in this cause can be used to visualize the potential unique taxonomic labels, but is not required in the taxon directory process.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all observations.

required

Returns:

Type Description
dict

A dictionary whereby the keys are the specified taxon levels, and the values are a list of unique taxon labels at each level.

Source code in dataset_structure.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def taxonomic_analysis(df: pd.DataFrame):
    """This method performs a taxonomic analysis, whereby it identifies each unique label at the specified taxonomic levels.

    This method in this cause can be used to visualize the potential unique taxonomic labels, but is not required in the taxon directory process.
    Args:
        df (DataFrame): The dataframe containing all observations.

    Returns:
        (dict): A dictionary whereby the keys are the specified taxon levels, and the values are a list of unique taxon labels at each level.
    """
    taxonomy_list = ['taxon_family_name', 'taxon_genus_name', 'taxon_species_name', 'sub_species']  # Specify taxonomic levels at which the analysis should occur
    taxon_breakdown = dict()  # Dictionary to contain unique values per level.

    for taxon in taxonomy_list:  # For each taxonomic level
        df = df.dropna(subset=[taxon])  # Remove all n/a labels
        taxon_breakdown[taxon] = df[taxon].unique().tolist()  # Find unique taxon level genus names
    return taxon_breakdown