Skip to content

Code Documentation

This file performs the binary image labelling.

The file allows you to tailor the binary image labelling to suite the use case by adapting specific variables. Please consult the README or Documentation for further information.

Attributes:

Name Type Description
root_path str

The absolute path of the project root.

data_path str

The complete path (absolute + relative) to the project data directory

observation_path str

The complete path to the data/observations/ directory. This directory contains the iNaturalist observations.

labelled_path str

The complete path to the data/labelled/ directory. This directory contains a record of the labelled observations.

labelled_file str

The filename of where the labelling history is going to be collected.

image_path str

The complete path to where the images to be labelled are found.

labelled_image_path str

The complete path to the directory where the labelled images are saved in binary directories to seperate the classes after being labelled.

binary_labels dict

Linking of the numerical key values to the expected labels. An ignore label is also provided.

positive_count int

The count of the number of positive labels recorded

negative_count int

The count of the number of negative labels recorded

ignore_class str

The variable enables specification of a class to ignore recording labels for. Review the README or documentation for more information.

aggregate_datasets(datasets)

This method aggregates the specified dataset list into a single dataframe for further use.

Note, this method can be used if the user requires the images to be matched to a dataset. In the current format, the labelling process only requires the images. This method is in place to offer the capability to extend the labelling process if required.

Parameters:

Name Type Description Default
datasets list

A list of dataset file names. These will be aggregated into a single dataframe.

required

Returns:

Type Description
DataFrame

A single dataframe comprising the aggregated datasets.

Source code in main.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def aggregate_datasets(datasets: list) -> pd.DataFrame:
    """This method aggregates the specified dataset list into a single dataframe for further use.

    Note, this method can be used if the user requires the images to be matched to a dataset.
    In the current format, the labelling process only requires the images. This method is in place to offer the
    capability to extend the labelling process if required.

    Args:
        datasets (list): A list of dataset file names. These will be aggregated into a single dataframe.

    Returns:
        (DataFrame): A single dataframe comprising the aggregated datasets.
    """
    df = pd.DataFrame()
    for dataset in datasets:  # Iterate through the datasets
        current_df = pd.read_csv(observation_path + dataset, index_col=0)  # Read in the currrent dataset as a dataframe

        current_df = current_df[current_df['taxon_species_name'] != 'Felis catus']  # Apply known Felis catus restriction
        df = pd.concat([df, current_df], sort=False)  # Concatenate the dataframes
    return df

avoid_duplicate_images(filenames)

This method removes images from filenames, that have already been processed to avoid repeated work.

Parameters:

Name Type Description Default
filenames list

The list of all image filenames to be labelled.

required

Returns:

Type Description
list

A list of filenames, with those already labelled removed.

Source code in main.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
def avoid_duplicate_images(filenames: list):
    """This method removes images from filenames, that have already been processed to avoid repeated work.

    Args:
        filenames (list): The list of all image filenames to be labelled.

    Returns:
        (list): A list of filenames, with those already labelled removed.
    """
    if os.path.exists(labelled_path + labelled_file):
        df_labelled = pd.read_csv(labelled_path + labelled_file)  # Read in the labelled dataset
        labelled_images = df_labelled['id'].tolist()  # Generate a list of id's in the labelled dataset
        filenames = filter(lambda i: i not in labelled_images, filenames)  # Filter out already labelled images
        update_binary_counts(df_labelled)  # Update the positive and negative counts
    else:
        with open(labelled_path + labelled_file, 'w') as file:  # Create an empty file if it doesn't exist
            file.write("id,label\n")

    return filenames

copy_to_labelled_images(filename, label)

This method copies the labelled image to the corresponding directory within data/labelled/images/

Note, the binary directories will be created automatically based on the categorical names provided in the code. In the end, the data/labelled/images/ directory will contain two additional directories housing images of each class.

Parameters:

Name Type Description Default
filename str

The name of the file to be copied into the labelled images directory.

required
label str

The corresponding categorical label of the image.

required
Source code in main.py
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
def copy_to_labelled_images(filename: str, label: str):
    """This method copies the labelled image to the corresponding directory within `data/labelled/images/`

    Note, the binary directories will be created automatically based on the categorical names provided in the code.
    In the end, the `data/labelled/images/` directory will contain two additional directories housing images of each class.

    Args:
        filename (str): The name of the file to be copied into the labelled images directory.
        label (str): The corresponding categorical label of the image.
    """
    file_origin = image_path + filename
    file_destination_directory = labelled_image_path + label + "/"
    file_destination = file_destination_directory + filename

    if not os.path.exists(file_destination_directory):  # If the directory doesn't exist, then make the directory
        os.makedirs(file_destination_directory)

    shutil.copy(file_origin, file_destination)  # Copy the file from the origin directory into the newly specified directory.

display_image(filename)

This method displays the image specified by the filename.

The filename of the image is assumed to be located in the data/images/ directory. The image display will close upon the click of the button to label the image.

Parameters:

Name Type Description Default
filename str

The file name of the image to be displayed.

required

Returns:

Type Description
int

An integer encoding of the key pressed.

Source code in main.py
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
def display_image(filename: str):
    """This method displays the image specified by the filename.

    The filename of the image is assumed to be located in the `data/images/` directory.
    The image display will close upon the click of the button to label the image.

    Args:
        filename (str): The file name of the image to be displayed.

    Returns:
        (int): An integer encoding of the key pressed.
    """
    img = cv2.imread(image_path + filename)
    cv2.imshow('Current image', img)
    key_pressed = cv2.waitKey(0)
    return key_pressed

labelling_process()

This method controls the image labelling process.

In summary, the process is as follows: The data/images/ directory holds all of the images to be labelled. A check is conducted to ensure no images are repeatedly labelled. Each image is labelled, the image is copied into a corresponding directory. A history of each image filename and its corresponding label is maintained.

As a result, there exists a labelled_file.csv containing the labelling history. Additionally, within the data/labelled/images/ directory there exist two directories housing the binary labelled images.

Source code in main.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
def labelling_process():
    """This method controls the image labelling process.

    In summary, the process is as follows:
    The `data/images/` directory holds all of the images to be labelled.
    A check is conducted to ensure no images are repeatedly labelled.
    Each image is labelled, the image is copied into a corresponding directory.
    A history of each image filename and its corresponding label is maintained.

    As a result, there exists a `labelled_file.csv` containing the labelling history.
    Additionally, within the `data/labelled/images/` directory there exist two directories housing the binary labelled images.
    """
    filenames = os.listdir(image_path)  # Gather filenames from image directory. It is assumed the filenames are the ID's
    filenames = avoid_duplicate_images(filenames)  # avoid already labelled images

    labelled_files = []  # Storing the labelled files
    labels = []  # Storing the labels

    for filename in filenames:
        encoded_key = display_image(filename)

        try:
            label = binary_labels[encoded_key]  # Decode the label

            if label != ignore_class and label != 'Ignore':  # If the label isn's specified as the ignore class and its not the Ignore label
                labels.append(label)
                labelled_files.append(filename)

                status_update(encoded_key)  # Binary count update
                copy_to_labelled_images(filename, label)  # Copy image to data/labelled/images

        except:
            write_to_file(labelled_files, labels)  # Write the labelling history on exit.
            sys.exit()

remove_already_processed_observations(df)

This method removes the already labelled observations from the dataset.

The method accesses the already labelled dataset and extracts the unique observation IDs. It removes the IDs if they are present in the current dataset to avoid repetition.

Additionally, the method updates the positive and negative counts to keep track of the number of each binary label in the labelled dataset.

Note, this method can be used if the user requires the images to be matched to a dataset. In the current format, the labelling process only requires the images. This method is in place to offer the capability to extend the labelling process if required.

Parameters:

Name Type Description Default
df DataFrame

The current dataset to still be labelled.

required

Returns:

Type Description
DataFrame

The dataframe is returned with already labelled observations removed from it.

Source code in main.py
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
def remove_already_processed_observations(df: pd.DataFrame):
    """This method removes the already labelled observations from the dataset.

    The method accesses the already labelled dataset and extracts the unique observation IDs.
    It removes the IDs if they are present in the current dataset to avoid repetition.

    Additionally, the method updates the positive and negative counts to keep track of the number of each binary label in the
    labelled dataset.

    Note, this method can be used if the user requires the images to be matched to a dataset.
    In the current format, the labelling process only requires the images. This method is in place to offer the
    capability to extend the labelling process if required.

    Args:
        df (DataFrame): The current dataset to still be labelled.

    Returns:
        (DataFrame): The dataframe is returned with already labelled observations removed from it.
    """

    if os.path.exists(labelled_path + labelled_file):
        df_labelled = pd.read_csv(labelled_path + labelled_file)  # Read in the labelled dataset
        labelled_ids = df_labelled['id'].tolist()  # Generate a list of id's in the labelled dataset
        df = df.drop(labelled_ids)  # From the current dataset, drop the rows with the same id (id is the index).

        update_binary_counts(df_labelled)  # Update the binary counts
    else:
        with open(labelled_path + labelled_file, 'w') as file:  # Create an empty file if it doesn't exist
            file.write("id,label\n")

    return df

status_update(encoded_key)

This method updates the binary counts and displays the current counts to the terminal

Parameters:

Name Type Description Default
encoded_key int

The encode key value (numerical representation of the key pressed)

required
Source code in main.py
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
def status_update(encoded_key: int):
    """This method updates the binary counts and displays the current counts to the terminal

    Args:
        encoded_key (int): The encode key value (numerical representation of the key pressed)
    """
    global positive_count, negative_count

    if encoded_key == 49:  # Positive encoding
        positive_count += 1
    elif encoded_key == 48:  # Negative encoding
        negative_count += 1

    print(binary_labels[49] + ' count: ' + str(positive_count) + ', ' +
          binary_labels[48] + ' count: ' + str(negative_count))

update_binary_counts(df_labelled)

This method updates the binary counts based on the already labelled data.

This method updates the global binary counts of the file.

Parameters:

Name Type Description Default
df_labelled DataFrame

The dataframe containing the already labelled observations.

required
Source code in main.py
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def update_binary_counts(df_labelled: pd.DataFrame):
    """This method updates the binary counts based on the already labelled data.

    This method updates the global binary counts of the file.

    Args:
        df_labelled (DataFrame): The dataframe containing the already labelled observations.
    """
    global positive_count, negative_count

    if not df_labelled.empty:  # Update the binary label counts if the file is not empty
        counts = df_labelled['label'].value_counts().to_dict()  # Convert counts to a dictionary

        for label in counts.keys():  # Label matching to update counts
            if label == binary_labels[49]:
                positive_count = counts[label]
            elif label == binary_labels[48]:
                negative_count = counts[label]

write_to_file(filenames, labels)

This method writes the labelled files and their corresponding labels to the labelled_file

Parameters:

Name Type Description Default
filenames list

A list of filenames with the corresponding labels in the same order as the labels list.

required
labels list

The categorical labels of the images.

required
Source code in main.py
197
198
199
200
201
202
203
204
205
def write_to_file(filenames: list, labels):
    """This method writes the labelled files and their corresponding labels to the `labelled_file`

    Args:
        filenames (list): A list of filenames with the corresponding labels in the same order as the labels list.
        labels (list): The categorical labels of the images.
    """
    results_df = pd.DataFrame({'id': filenames, 'label': labels})
    results_df.to_csv(labelled_path + labelled_file, mode='a', index=False, header=False)