06 Compare Models to Select the Best Hyperparameters

Published: June, 2024, ATOM DDM Team

Please check out the companion tutorial video: Tutorial 06 Compare Models


This tutorial will review AMPL functions for visualizing the results of a hyperparameter search in order to find the optimal hyperparameters for your model.

After performing a hyperparameter search, it is prudent to examine each hyperparameter in order to determine the best combination before training a production model with all of the data. Additionally, it is good to explore multiple performance metrics and visualize the predictions instead of relying solely on metrics.

For the purposes of this tutorial, we simply ran Tutorial 5, “Hyperparameter Optimization”, with different parameters such as those outlined here to get enough models for comparison. Specifically, we created additional NN and XGBoost models as well as using fingerprint and scaffold splits. If you don’t want to run that many models, you can use the result_df saved here: dataset/SLC6A3_models/07_example_pred_df.csv.

In this tutorial, we will focus on these functions:

Import Packages

from atomsci.ddm.pipeline import compare_models as cm
from atomsci.ddm.pipeline import hyper_perf_plots as hpp
from atomsci.ddm.pipeline import perf_plots as pp
import pandas as pd
pd.set_option('display.max_columns', None)

# ignore warnings in tutorials
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)

Get Model Results and Filter

First we pull the results of the hyperparameter search into a dataframe. In Tutorial 5, “Hyperparameter Optimization”, we used get_filesystem_perf_results() which packs hyperparameters in a dict in the column model_parameters_dict. Here we use the individual hyperparameter columns to create visualizations.

The result_df used here as an example is the result of calling get_filesystem_perf_results() once after training several hundred models with different parameters. These models were all saved in a single folder, but the function works iteratively so it can search an entire directory tree if a parent folder is passed.

# # call the function yourself
# result_df=cm.get_filesystem_perf_results(result_dir='dataset/SLC6A3_models/', pred_type='regression')

# use example results
result_df=pd.read_csv('dataset/SLC6A3_models/07_example_pred_df.csv', index_col=0)

result_df=result_df.sort_values('best_valid_r2_score', ascending=False)
print(result_df.shape)

# show useful columns
result_df[['model_uuid', 'split_uuid', 'best_train_r2_score', 'best_valid_r2_score', 'best_test_r2_score']].head()
(467, 41)

model_uuid

split_uuid

best_train_r2_score

best_valid_r2_score

best_test_r2_score

310

b24a2887-8eca-43e2-8fc2-3642189d2c94

c35aeaab-910c-4dcf-8f9f-04b55179aa1a

0.776865

0.562091

0.519325

306

9b6c9332-15f3-4f96-9579-bf407d0b69a8

c35aeaab-910c-4dcf-8f9f-04b55179aa1a

0.950677

0.559590

0.480895

291

d8e25713-bf6c-4ba0-917a-69f0ea1472b9

c35aeaab-910c-4dcf-8f9f-04b55179aa1a

0.930726

0.557156

0.497192

334

1a3e4b13-f860-4acf-8300-e2ac5766caa6

c35aeaab-910c-4dcf-8f9f-04b55179aa1a

0.801391

0.556679

0.523567

321

57a62e20-f18e-4afa-8ce2-315f0fcc950a

c35aeaab-910c-4dcf-8f9f-04b55179aa1a

0.780693

0.556103

0.521270

We can look at a brief count of models for important parameters by creating a pivot table. Here we can see ECFP fingerprints and RDKit features and fingerprint and scaffold splitters were used for each model type. A fingerprint splitter provides a more stringent test of model performance by making sure the validation and test compounds are structurally dissimilar to the training set compounds.

# model counts
model_counts=pd.DataFrame(result_df.groupby(['features','splitter','model_type'])['model_uuid'].count()).reset_index()
model_counts=model_counts.pivot(index='model_type',columns=['splitter','features',], values='model_uuid')
model_counts

splitter

fingerprint

scaffold

fingerprint

scaffold

features

ecfp

ecfp

rdkit_raw

rdkit_raw

model_type

NN

26

29

25

96

RF

30

30

30

32

xgboost

47

26

20

76

Often, certain random combinations of hyperparameters result in terribly performing models. Here we will filter those out so they don’t affect the visualization by only keeping models with a validation r2_score of 0.1 or greater.

result_df.best_valid_r2_score.describe()
count    4.670000e+02
mean    -6.111789e+73
std      1.320769e+75
min     -2.854206e+76
25%     -2.751967e-01
50%      2.719028e-01
75%      4.323609e-01
max      5.620908e-01
Name: best_valid_r2_score, dtype: float64
# filter out objectively bad performing models
result_df=result_df[result_df.best_valid_r2_score>0.1]
result_df.shape
(264, 41)
result_df.best_valid_r2_score.describe()
count    264.000000
mean       0.405931
std        0.108515
min        0.110739
25%        0.337459
50%        0.418931
75%        0.484987
max        0.562091
Name: best_valid_r2_score, dtype: float64

After filtering out models with extremely poor metrics, we can see that some combinations don’t work at all, and are completely filtered from the set. For example, decision tree based models using RDKit or ECFP features work very poorly to predict on fingerprint-split models.

#  model counts
model_counts=pd.DataFrame(result_df.groupby(['features','splitter','model_type'])['model_uuid'].count()).reset_index()
model_counts=model_counts.pivot(index='model_type',columns=['splitter','features',], values='model_uuid')
model_counts

splitter

fingerprint

scaffold

fingerprint

scaffold

features

ecfp

ecfp

rdkit_raw

rdkit_raw

model_type

NN

8.0

23.0

11.0

86.0

RF

NaN

30.0

NaN

32.0

xgboost

3.0

21.0

NaN

50.0

Visualize Hyperparameters

There are several plotting functions in the hyper_perf_plots module that help visualize the different combinations of features for each type of model.

Examine overall scores

plot_train_valid_test_scores() gives a quick snapshot of your overall model performance. You can see if you overfitted and get a sense of whether your partitions are a good representation of future performance. Because the splitter can have a drastic effect on model performance, these plots are also separated by split type.

Here we see a fairly typical pattern where the training set metrics are higher than validation and test partitions. It is good to see that the validation and test scores are similar across many models, indicating that the models should generalize to new data well. For fingerprint splits, we see an odd trend where the model performs better on the test set than the validation set (remember - we want to minimize MAE or RMSE!), suggesting that the split is problematic since the validation set does not necessarily reflect the generalization capability of the model accurately.

hpp.plot_train_valid_test_scores(result_df, prediction_type='regression')
../_images/06_compare_models_15_0.png

Examine Splits

plot_split_perf() plots the performance of each split type, separated by feature type, for each performance metric.

We can see that fingerprint splits perform much worse than scaffold splits for this dataset, and but RDKit and ECFP features perform differently. ECFP features work better for scaffold splits while RDKit features work better for fingerprint splits. Recalling the filtering from above, we know that RDKit features for fingerprint splits are only represented by NN models, which may skew these results.

hpp.plot_split_perf(result_df, subset='valid')
../_images/06_compare_models_17_0.png

General Model Features

We also want to understand general hyperparameters like model type and feature type and their effect on performance. We can use plot_hyper_perf() with model_type='general' as a shortcut to visualize these.

We can see that random forests or neural networks perform the best while ECFP features perform better than RDKit. Additionally, the random forest models are very consistent while there is more variability in the NN and XGBoost model performance.

hpp.plot_hyper_perf(result_df, model_type='general')
../_images/06_compare_models_19_0.png

RF-specific Hyperparameters

We can also use plot_hyper_perf() to visualize model-specific hyperparameters. In this case we examine random forest models because they generally perform the best for this dataset.

Here, we can see two distinct sets of valid_r2_scores (probably from fingerprint vs scaffold split models), but both sets show similar trends. For rf_estimators it looks like 100-150 trees is optimal, while rf_max_depth does worse below ~15 and improves slowly after that. rf_max_features doesn’t show a clear trend except that below 50 might result in worse models.

hpp.plot_hyper_perf(result_df, model_type='RF', subset='valid', scoretype='r2_score')
../_images/06_compare_models_21_0.png

We can quickly get a list of scores to plot with get_score_types() and create the same plots with different metrics.

hpp.get_score_types()
Classification metrics:  ['roc_auc_score', 'prc_auc_score', 'precision', 'recall_score', 'npv', 'accuracy_score', 'kappa', 'matthews_cc', 'bal_accuracy']
Regression metrics:  ['r2_score', 'mae_score', 'rms_score']
hpp.plot_hyper_perf(result_df, model_type='RF', subset='valid', scoretype='mae_score')
../_images/06_compare_models_24_0.png

NN Visualization

When visualizing hyperparameters of NN models in this case, it is slightly hard to see important trends because there is a large variance in their model performance. To avoid this, we use plot_hyper_perf() with a subsetted dataframe to look at a single combination of splitter and features.

Plot Features

Description

avg_dropout

The average of dropout proportions across all layers of the model. This parameter can affect the generalizability and overfitting of the model and usually dropout of 0.1 or higher is best.

learning_rate

The learning rate during training. Generally, learning rates that are ~10e-3 do best.

num_weights

The product of layer sizes plus number of nodes in first layer, a rough estimate of total model size/complexity. This parameter should be minimized by selecting the smallest layer sizes possible that still maximize the preferred metric

num_layers

The number of layers in the NN, another marker of complexity. This should also be minimized.

best_epoch

Which epoch had the highest performance metric during training. This can indicate problematic training if the best_epochs are very small.

max_epochs

The max number of epochs the model was allowed to train (although “early stopping” may have occurred). If the max_epochs is too small you may underfit your model. This could be shown by all of your best_epochs being at max_epoch.

subsetted=result_df[result_df.splitter=='scaffold']
subsetted=subsetted[subsetted.features=='rdkit_raw']
hpp.plot_hyper_perf(subsetted, model_type='NN')
../_images/06_compare_models_26_0.png

XGBoost Visualization

Using plot_xg_perf(), we can simultaneously visualize the two most important parameters for XGBoost models - the learning rate and gamma. We can see that xgb_learning_rate should be between 0 and 0.45, after which the performance starts to deteriorate. There’s no clear trend for xgb_gamma. We can additionally use plot_hyper_perf() to visualize more XGBoost parameters, but this is not shown here.

# hpp.plot_hyper_perf(result_df, model_type='xgboost')
hpp.plot_xg_perf(result_df)
../_images/06_compare_models_29_0.png

Evaluation of a Single Model

After calling compare_models.get_filesystem_perf_results(), the dataframe can be sorted according to the score you care about. The column model_parameters_dict contains hyperparameters used for the best model. We can visualize this model using perf_plots.plot_pred_vs_actual_from_file().

Note

Not all scores should be maximized. For example, “mae_score” or “rms_score” should be minimized instead.

winnertype='best_valid_r2_score'

# result_df=cm.get_filesystem_perf_results(result_dir='dataset/SLC6A3_models/', pred_type='regression')

result_df=pd.read_csv('dataset/SLC6A3_models/07_example_pred_df.csv', index_col=0)
result_df=result_df.sort_values(winnertype, ascending=False)
result_df[['model_type','features','splitter',"dropouts",'best_train_r2_score','best_valid_r2_score','best_test_r2_score','model_uuid']].head()

model_type

features

splitter

dropouts

best_train_r2_score

best_valid_r2_score

best_test_r2_score

model_uuid

310

NN

ecfp

scaffold

0.28,0.30,0.30

0.776865

0.562091

0.519325

b24a2887-8eca-43e2-8fc2-3642189d2c94

306

RF

ecfp

scaffold

NaN

0.950677

0.559590

0.480895

9b6c9332-15f3-4f96-9579-bf407d0b69a8

291

RF

ecfp

scaffold

NaN

0.930726

0.557156

0.497192

d8e25713-bf6c-4ba0-917a-69f0ea1472b9

334

NN

ecfp

scaffold

0.40,0.26,0.00

0.801391

0.556679

0.523567

1a3e4b13-f860-4acf-8300-e2ac5766caa6

321

NN

ecfp

scaffold

0.39,0.05,0.10

0.780693

0.556103

0.521270

57a62e20-f18e-4afa-8ce2-315f0fcc950a

We can examine important parameters of the top model directly from the result_df.

We see that through hyperparameter optimization, we have increased our best_valid_r2_score to 0.56, as compared to our baseline model valid_r2_score of 0.50011 (from Tutorial 3, “Train a Simple Regression Model”).

result_df.iloc[0][['features','splitter','best_valid_r2_score']]
features                   ecfp
splitter               scaffold
best_valid_r2_score    0.562091
Name: 310, dtype: object
result_df.iloc[0].model_parameters_dict
'{"best_epoch": 24, "dropouts": [0.27866421599874197, 0.3041982566364109, 0.29943876674824], "layer_sizes": [369, 283, 146], "learning_rate": 8.28816038984145e-05, "max_epochs": 100}'
result_df.iloc[0].model_path
'dataset/SLC6A3_models/SLC6A3_Ki_curated_model_b24a2887-8eca-43e2-8fc2-3642189d2c94.tar.gz'

Here we use plot_pred_vs_actual_from_file() to visualize the prediction accuracy for the train, validation and test sets.

# plot best model, an NN
import importlib
importlib.reload(pp)
model_path=result_df.iloc[0].model_path
pp.plot_pred_vs_actual_from_file(model_path)
2024-06-25 18:59:07,633 dataset/SLC6A3_models/SLC6A3_Ki_curated_model_b24a2887-8eca-43e2-8fc2-3642189d2c94.tar.gz, 1.6.0
2024-06-25 18:59:07,634 Version compatible check: dataset/SLC6A3_models/SLC6A3_Ki_curated_model_b24a2887-8eca-43e2-8fc2-3642189d2c94.tar.gz version = "1.6", AMPL version = "1.6"
['/tmp/tmph9rlkgwn/best_model/checkpoint1.pt']
/tmp/tmph9rlkgwn/best_model/checkpoint1.pt
../_images/06_compare_models_38_2.png

This NN model looks like it isn’t very good at predicting things with \(pKi\) < 4.5. Additionally, there is a set of data at \(pKi\) =5 (this data is censored and all we know is that the compounds have a \(pKi\) < 5 because higher concentrations of drug were not tested). This data is poorly predicted by the NN model.

Note

Be wary of selecting models only based on their performance metrics! As we can see, this NN has problems even though the r2_score is fairly high.

# plot best RF model
model_type='RF'
model_path=result_df[result_df.model_type==model_type].iloc[0].model_path
pp.plot_pred_vs_actual_from_file(model_path)
print('\nBest valid r2 score: ',result_df[result_df.model_type==model_type].iloc[0].best_valid_r2_score)
print('\nModel Parameters: ',result_df[result_df.model_type==model_type].iloc[0].model_parameters_dict,'\n')
Best valid r2 score:  0.5595899501867392

Model Parameters:  {"rf_estimators": 129, "rf_max_depth": 32, "rf_max_features": 95}
../_images/06_compare_models_40_2.png

This RF model looks like it did better at training than the best NN model, even though its performance validation score is slightly lower. The low \(pKi\) values are learned more accurately in the training set, and the censored data at \(pKi\) =5 is also predicted more accurately.

# plot best xgboost model
model_type='xgboost'
model_path=result_df[result_df.model_type==model_type].iloc[0].model_path
pp.plot_pred_vs_actual_from_file(model_path)
print('\nBest valid r2 score: ',result_df[result_df.model_type==model_type].iloc[0].best_valid_r2_score)
print('\nModel Parameters: ',result_df[result_df.model_type==model_type].iloc[0].model_parameters_dict,'\n')
Best valid r2 score:  0.5031490908520113

Model Parameters:  {"xgb_colsample_bytree": 1.0, "xgb_gamma": 0.0019288871251215423, "xgb_learning_rate": 0.2158168689218416, "xgb_max_depth": 6, "xgb_min_child_weight": 1.0, "xgb_n_estimators": 100, "xgb_subsample": 1.0}
../_images/06_compare_models_42_2.png

This XGBoost model learns the low \(pKi\) values better but still suffers from problems with predicting the censored data.

Moving forward, we would select the RF model as the best performer.

In Tutorial 7, “Train a Production Model”, we will use the best-performing parameters to create a production model for the entire dataset.

If you have specific feedback about a tutorial, please complete the AMPL Tutorial Evaluation.