Skip to content

Decision Tree Model

This file creates and trains the decision tree metadata classification model.

The decision tree metadata classification model performs hyperparameter tuning over the depth of the decision tree. The training process makes use of 5-fold cross validation to evaluate the models performance for each hyperparameter. A best-model save policy is enforced using the mean balanced accuracy across the 5-folds.

Attributes:

Name Type Description
root_path str

The path to the project root.

data_destination str

The path to where the decision tree model and its training accuracy is saved.

decision_tree_process(df, taxon_target, model_name, score_file, validation_file)

This method specifies the decision tree modelling process.

Specifically this method, calls the required pipeline (decision tree pipeline) to generate the features and labels required for training. Then, calls the training process to use the data.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing all data for each observation.

required
taxon_target str

The taxonomic target level, to extract the correct labels (taxon_family_name, taxon_genus_name, taxon_species_name, subspecies)

required
model_name str

The name of the model type being trained. In this case 'Decision tree'.

required
score_file str

The filename of where the training data will be stored. This will have the same location as where the model is saved.

required
validation_file str

The name of the file where the validation data will be stored. Also informs the name of the saved models.

required
Source code in src/models/meta/decision_tree.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def decision_tree_process(df: pd.DataFrame, taxon_target: str, model_name: str, score_file: str, validation_file: str):
    """This method specifies the decision tree modelling process.

    Specifically this method, calls the required pipeline (decision tree pipeline) to generate the features and labels required for training.
    Then, calls the training process to use the data.

    Args:
        df (DataFrame): The dataframe containing all data for each observation.
        taxon_target (str): The taxonomic target level, to extract the correct labels (taxon_family_name, taxon_genus_name, taxon_species_name, subspecies)
        model_name (str): The name of the model type being trained. In this case 'Decision tree'.
        score_file (str): The filename of where the training data will be stored. This will have the same location as where the model is saved.
        validation_file (str): The name of the file where the validation data will be stored. Also informs the name of the saved models.
    """
    X, y = pipelines.decision_tree_data(df, taxon_target, validation_file)  # Processing and formatting
    train_decision_tree(X, y, model_name, score_file)  # Training and evaluation

train_decision_tree(X, y, model_name, score_file)

This method performs the decision tree training and hyperparameter tuning.

Hyperparameter tuning aims to determine the optimal decision tree depth to be used in each classification model. This process uses a best-model save policy based on the mean balanced accuracy evaluation metric.

Parameters:

Name Type Description Default
X DataFrame

The input features to the decision tree

required
y Series

The categorical taxonomic labels of the corresponding observations to the features.

required
model_name str

The name of the model type being trained. In this case 'Decision tree'.

required
score_file str

The filename of where the training data will be stored.

required
Source code in src/models/meta/decision_tree.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def train_decision_tree(X, y, model_name: str, score_file: str):
    """This method performs the decision tree training and hyperparameter tuning.

    Hyperparameter tuning aims to determine the optimal decision tree depth to be used in each classification model.
    This process uses a best-model save policy based on the mean balanced accuracy evaluation metric.

    Args:
        X (DataFrame): The input features to the decision tree
        y (Series): The categorical taxonomic labels of the corresponding observations to the features.
        model_name (str): The name of the model type being trained. In this case 'Decision tree'.
        score_file (str): The filename of where the training data will be stored.
    """
    depth_limit = len(X.columns)  # Calculate the depth limit as the number of input features
    depth_range = range(1, depth_limit, 2)  # Generate the depth range using an interval of 2
    best_accuracy = 0  # Instantiate the best accuracy holder
    scores = []  # Scores holder

    for depth in depth_range:  # Iterate through decision tree depths
        classes = np.unique(y)  # Weight the training by presence of each class.
        weight_values = compute_class_weight(class_weight='balanced', classes=classes, y=y)
        weights = dict(zip(classes, weight_values))  # Zip the calculated weights to the classes to form a dict

        clf = DecisionTreeClassifier(max_depth=depth, random_state=0, class_weight=weights)  # Create the model
        score = cross_val_score(estimator=clf,
                                X=X,
                                y=y,
                                cv=5,
                                n_jobs=-1,
                                scoring='balanced_accuracy')  # Train and evaluate the model using 5-fold cross-validation

        score_mean = np.mean(score)  # Average the scores

        clf.fit(X.values, y)  # Retrain the model for saving purposes if it is the top-performer

        scores.append(score_mean)  # Save mean score
        print(f"Depth {depth} out of {depth_limit}, generates {score_mean} accuracy")

        if best_accuracy < score_mean:  # Best model save policy
            filename = root_path + data_destination + model_name
            best_accuracy = score_mean
            pickle.dump(clf, open(filename, 'wb'))  # Save the model

    write_scores_to_file(scores, [*depth_range], score_file)  # Write mean scores/ loss to file

write_scores_to_file(mean_scores, depth_range, filename)

This method writes the model mean training and evaluation scores to a csv file for visualization and records.

Note, the data written includes the mean training balanced accuracy value calculated in the cross validation.

Parameters:

Name Type Description Default
mean_scores list

The list of mean accuracy scores for each model.

required
depth_range list

: The list containing the decision tree depths trained over. For each element, there is a corresponding mean accuracy in mean_scores.

required
filename str

The filename, where the training data will be saved.

required
Source code in src/models/meta/decision_tree.py
32
33
34
35
36
37
38
39
40
41
42
43
def write_scores_to_file(mean_scores: list, depth_range: list, filename: str):
    """This method writes the model mean training and evaluation scores to a csv file for visualization and records.

    Note, the data written includes the mean training balanced accuracy value calculated in the cross validation.

    Args:
        mean_scores (list): The list of mean accuracy scores for each model.
        depth_range (list):: The list containing the decision tree depths trained over. For each element, there is a corresponding mean accuracy in mean_scores.
        filename (str): The filename, where the training data will be saved.
    """
    df = pd.DataFrame({'depth': depth_range, 'mean_scores': mean_scores})
    df.to_csv(root_path + data_destination + filename, index=False)