06 Compare Models to Select the Best Hyperparameters¶
Published: June, 2024, ATOM DDM Team
Please check out the companion tutorial video: ![]()
This tutorial will review AMPL functions for visualizing the results of a hyperparameter search in order to find the optimal hyperparameters for your model.
After performing a hyperparameter search, it is prudent to examine each hyperparameter in order to determine the best combination before training a production model with all of the data. Additionally, it is good to explore multiple performance metrics and visualize the predictions instead of relying solely on metrics.
For the purposes of this tutorial, we simply ran Tutorial 5,
“Hyperparameter Optimization”, with different parameters such as those
outlined
here
to get enough models for comparison. Specifically, we created additional
NN and XGBoost
models as well as using fingerprint and scaffold splits. If you
don’t want to run that many models, you can use the result_df saved
here: dataset/SLC6A3_models/07_example_pred_df.csv.
In this tutorial, we will focus on these functions:
Import Packages¶
from atomsci.ddm.pipeline import compare_models as cm
from atomsci.ddm.pipeline import hyper_perf_plots as hpp
from atomsci.ddm.pipeline import perf_plots as pp
import pandas as pd
pd.set_option('display.max_columns', None)
# ignore warnings in tutorials
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)
Get Model Results and Filter¶
First we pull the results of the hyperparameter search into a dataframe.
In Tutorial 5, “Hyperparameter Optimization”, we used
get_filesystem_perf_results() which packs hyperparameters in a dict
in the column model_parameters_dict. Here we use the individual
hyperparameter columns to create visualizations.
The result_df used here as an example is the result of calling
get_filesystem_perf_results() once after training several hundred
models with different parameters. These models were all saved in a
single folder, but the function works iteratively so it can search an
entire directory tree if a parent folder is passed.
# # call the function yourself
# result_df=cm.get_filesystem_perf_results(result_dir='dataset/SLC6A3_models/', pred_type='regression')
# use example results
result_df=pd.read_csv('dataset/SLC6A3_models/07_example_pred_df.csv', index_col=0)
result_df=result_df.sort_values('best_valid_r2_score', ascending=False)
print(result_df.shape)
# show useful columns
result_df[['model_uuid', 'split_uuid', 'best_train_r2_score', 'best_valid_r2_score', 'best_test_r2_score']].head()
(467, 41)
model_uuid |
split_uuid |
best_train_r2_score |
best_valid_r2_score |
best_test_r2_score |
|
|---|---|---|---|---|---|
310 |
b24a2887-8eca-43e2-8fc2-3642189d2c94 |
c35aeaab-910c-4dcf-8f9f-04b55179aa1a |
0.776865 |
0.562091 |
0.519325 |
306 |
9b6c9332-15f3-4f96-9579-bf407d0b69a8 |
c35aeaab-910c-4dcf-8f9f-04b55179aa1a |
0.950677 |
0.559590 |
0.480895 |
291 |
d8e25713-bf6c-4ba0-917a-69f0ea1472b9 |
c35aeaab-910c-4dcf-8f9f-04b55179aa1a |
0.930726 |
0.557156 |
0.497192 |
334 |
1a3e4b13-f860-4acf-8300-e2ac5766caa6 |
c35aeaab-910c-4dcf-8f9f-04b55179aa1a |
0.801391 |
0.556679 |
0.523567 |
321 |
57a62e20-f18e-4afa-8ce2-315f0fcc950a |
c35aeaab-910c-4dcf-8f9f-04b55179aa1a |
0.780693 |
0.556103 |
0.521270 |
We can look at a brief count of models for important parameters by creating a pivot table. Here we can see ECFP fingerprints and RDKit features and fingerprint and scaffold splitters were used for each model type. A fingerprint splitter provides a more stringent test of model performance by making sure the validation and test compounds are structurally dissimilar to the training set compounds.
# model counts
model_counts=pd.DataFrame(result_df.groupby(['features','splitter','model_type'])['model_uuid'].count()).reset_index()
model_counts=model_counts.pivot(index='model_type',columns=['splitter','features',], values='model_uuid')
model_counts
splitter |
fingerprint |
scaffold |
fingerprint |
scaffold |
|---|---|---|---|---|
features |
ecfp |
ecfp |
rdkit_raw |
rdkit_raw |
model_type |
||||
NN |
26 |
29 |
25 |
96 |
RF |
30 |
30 |
30 |
32 |
xgboost |
47 |
26 |
20 |
76 |
Often, certain random combinations of hyperparameters result in terribly
performing models. Here we will filter those out so they don’t affect
the visualization by only keeping models with a validation r2_score
of 0.1 or greater.
result_df.best_valid_r2_score.describe()
count 4.670000e+02
mean -6.111789e+73
std 1.320769e+75
min -2.854206e+76
25% -2.751967e-01
50% 2.719028e-01
75% 4.323609e-01
max 5.620908e-01
Name: best_valid_r2_score, dtype: float64
# filter out objectively bad performing models
result_df=result_df[result_df.best_valid_r2_score>0.1]
result_df.shape
(264, 41)
result_df.best_valid_r2_score.describe()
count 264.000000
mean 0.405931
std 0.108515
min 0.110739
25% 0.337459
50% 0.418931
75% 0.484987
max 0.562091
Name: best_valid_r2_score, dtype: float64
After filtering out models with extremely poor metrics, we can see that some combinations don’t work at all, and are completely filtered from the set. For example, decision tree based models using RDKit or ECFP features work very poorly to predict on fingerprint-split models.
# model counts
model_counts=pd.DataFrame(result_df.groupby(['features','splitter','model_type'])['model_uuid'].count()).reset_index()
model_counts=model_counts.pivot(index='model_type',columns=['splitter','features',], values='model_uuid')
model_counts
splitter |
fingerprint |
scaffold |
fingerprint |
scaffold |
|---|---|---|---|---|
features |
ecfp |
ecfp |
rdkit_raw |
rdkit_raw |
model_type |
||||
NN |
8.0 |
23.0 |
11.0 |
86.0 |
RF |
NaN |
30.0 |
NaN |
32.0 |
xgboost |
3.0 |
21.0 |
NaN |
50.0 |
Visualize Hyperparameters¶
There are several plotting functions in the hyper_perf_plots module
that help visualize the different combinations of features for each type
of model.
Examine overall scores¶
plot_train_valid_test_scores() gives a quick snapshot of your
overall model performance. You can see if you overfitted and get a sense
of whether your partitions are a good representation of future
performance. Because the splitter can have a drastic effect on model
performance, these plots are also separated by split type.
Here we see a fairly typical pattern where the training set metrics are higher than validation and test partitions. It is good to see that the validation and test scores are similar across many models, indicating that the models should generalize to new data well. For fingerprint splits, we see an odd trend where the model performs better on the test set than the validation set (remember - we want to minimize MAE or RMSE!), suggesting that the split is problematic since the validation set does not necessarily reflect the generalization capability of the model accurately.
hpp.plot_train_valid_test_scores(result_df, prediction_type='regression')
Examine Splits¶
plot_split_perf() plots the performance of each split type,
separated by feature type, for each performance metric.
We can see that fingerprint splits perform much worse than scaffold splits for this dataset, and but RDKit and ECFP features perform differently. ECFP features work better for scaffold splits while RDKit features work better for fingerprint splits. Recalling the filtering from above, we know that RDKit features for fingerprint splits are only represented by NN models, which may skew these results.
hpp.plot_split_perf(result_df, subset='valid')
General Model Features¶
We also want to understand general hyperparameters like model type and
feature type and their effect on performance. We can use
plot_hyper_perf() with model_type='general' as a shortcut to
visualize these.
We can see that random forests or neural networks perform the best while ECFP features perform better than RDKit. Additionally, the random forest models are very consistent while there is more variability in the NN and XGBoost model performance.
hpp.plot_hyper_perf(result_df, model_type='general')
RF-specific Hyperparameters¶
We can also use plot_hyper_perf() to visualize model-specific
hyperparameters. In this case we examine random forest models because
they generally perform the best for this dataset.
Here, we can see two distinct sets of valid_r2_scores (probably from
fingerprint vs scaffold split models), but both sets show
similar trends. For rf_estimators it looks like 100-150 trees is
optimal, while rf_max_depth does worse below ~15 and improves slowly
after that. rf_max_features doesn’t show a clear trend except that
below 50 might result in worse models.
hpp.plot_hyper_perf(result_df, model_type='RF', subset='valid', scoretype='r2_score')
We can quickly get a list of scores to plot with get_score_types()
and create the same plots with different metrics.
hpp.get_score_types()
Classification metrics: ['roc_auc_score', 'prc_auc_score', 'precision', 'recall_score', 'npv', 'accuracy_score', 'kappa', 'matthews_cc', 'bal_accuracy']
Regression metrics: ['r2_score', 'mae_score', 'rms_score']
hpp.plot_hyper_perf(result_df, model_type='RF', subset='valid', scoretype='mae_score')
NN Visualization¶
When visualizing hyperparameters of NN models in this case, it is
slightly hard to see important trends because there is a large variance
in their model performance. To avoid this, we use plot_hyper_perf()
with a subsetted dataframe to look at a single combination of splitter
and features.
Plot Features |
Description |
|---|---|
avg_dropout |
The average of dropout proportions across all layers of the model. This parameter can affect the generalizability and overfitting of the model and usually dropout of 0.1 or higher is best. |
learning_rate |
The learning rate during training. Generally, learning rates that are ~10e-3 do best. |
num_weights |
The product of layer sizes plus number of nodes in first layer, a rough estimate of total model size/complexity. This parameter should be minimized by selecting the smallest layer sizes possible that still maximize the preferred metric |
num_layers |
The number of layers in the NN, another marker of complexity. This should also be minimized. |
best_epoch |
Which epoch had the highest performance metric during training. This can indicate problematic training if the best_epochs are very small. |
max_epochs |
The max number of epochs the model was allowed to train (although “early stopping” may have occurred). If the max_epochs is too small you may underfit your model. This could be shown by all of your best_epochs being at max_epoch. |
subsetted=result_df[result_df.splitter=='scaffold']
subsetted=subsetted[subsetted.features=='rdkit_raw']
hpp.plot_hyper_perf(subsetted, model_type='NN')
XGBoost Visualization¶
Using plot_xg_perf(), we can simultaneously visualize the two most
important parameters for
XGBoost models - the
learning rate and gamma. We can see that xgb_learning_rate should be
between 0 and 0.45, after which the performance starts to deteriorate.
There’s no clear trend for xgb_gamma. We can additionally use
plot_hyper_perf() to visualize more
XGBoost parameters, but
this is not shown here.
# hpp.plot_hyper_perf(result_df, model_type='xgboost')
hpp.plot_xg_perf(result_df)
Evaluation of a Single Model¶
After calling compare_models.get_filesystem_perf_results(), the
dataframe can be sorted according to the score you care about. The
column model_parameters_dict contains hyperparameters used for the
best model. We can visualize this model using
perf_plots.plot_pred_vs_actual_from_file().
Note
Not all scores should be maximized. For example, “mae_score” or “rms_score” should be minimized instead.
winnertype='best_valid_r2_score'
# result_df=cm.get_filesystem_perf_results(result_dir='dataset/SLC6A3_models/', pred_type='regression')
result_df=pd.read_csv('dataset/SLC6A3_models/07_example_pred_df.csv', index_col=0)
result_df=result_df.sort_values(winnertype, ascending=False)
result_df[['model_type','features','splitter',"dropouts",'best_train_r2_score','best_valid_r2_score','best_test_r2_score','model_uuid']].head()
model_type |
features |
splitter |
dropouts |
best_train_r2_score |
best_valid_r2_score |
best_test_r2_score |
model_uuid |
|
|---|---|---|---|---|---|---|---|---|
310 |
NN |
ecfp |
scaffold |
0.28,0.30,0.30 |
0.776865 |
0.562091 |
0.519325 |
b24a2887-8eca-43e2-8fc2-3642189d2c94 |
306 |
RF |
ecfp |
scaffold |
NaN |
0.950677 |
0.559590 |
0.480895 |
9b6c9332-15f3-4f96-9579-bf407d0b69a8 |
291 |
RF |
ecfp |
scaffold |
NaN |
0.930726 |
0.557156 |
0.497192 |
d8e25713-bf6c-4ba0-917a-69f0ea1472b9 |
334 |
NN |
ecfp |
scaffold |
0.40,0.26,0.00 |
0.801391 |
0.556679 |
0.523567 |
1a3e4b13-f860-4acf-8300-e2ac5766caa6 |
321 |
NN |
ecfp |
scaffold |
0.39,0.05,0.10 |
0.780693 |
0.556103 |
0.521270 |
57a62e20-f18e-4afa-8ce2-315f0fcc950a |
We can examine important parameters of the top model directly from the
result_df.
We see that through hyperparameter optimization, we have increased our
best_valid_r2_score to 0.56, as compared to our baseline model
valid_r2_score of 0.50011 (from Tutorial 3, “Train a Simple
Regression Model”).
result_df.iloc[0][['features','splitter','best_valid_r2_score']]
features ecfp
splitter scaffold
best_valid_r2_score 0.562091
Name: 310, dtype: object
result_df.iloc[0].model_parameters_dict
'{"best_epoch": 24, "dropouts": [0.27866421599874197, 0.3041982566364109, 0.29943876674824], "layer_sizes": [369, 283, 146], "learning_rate": 8.28816038984145e-05, "max_epochs": 100}'
result_df.iloc[0].model_path
'dataset/SLC6A3_models/SLC6A3_Ki_curated_model_b24a2887-8eca-43e2-8fc2-3642189d2c94.tar.gz'
Here we use plot_pred_vs_actual_from_file() to visualize the
prediction accuracy for the train, validation and test sets.
# plot best model, an NN
import importlib
importlib.reload(pp)
model_path=result_df.iloc[0].model_path
pp.plot_pred_vs_actual_from_file(model_path)
2024-06-25 18:59:07,633 dataset/SLC6A3_models/SLC6A3_Ki_curated_model_b24a2887-8eca-43e2-8fc2-3642189d2c94.tar.gz, 1.6.0
2024-06-25 18:59:07,634 Version compatible check: dataset/SLC6A3_models/SLC6A3_Ki_curated_model_b24a2887-8eca-43e2-8fc2-3642189d2c94.tar.gz version = "1.6", AMPL version = "1.6"
['/tmp/tmph9rlkgwn/best_model/checkpoint1.pt']
/tmp/tmph9rlkgwn/best_model/checkpoint1.pt
This NN model looks like it isn’t very good at predicting things with \(pKi\) < 4.5. Additionally, there is a set of data at \(pKi\) =5 (this data is censored and all we know is that the compounds have a \(pKi\) < 5 because higher concentrations of drug were not tested). This data is poorly predicted by the NN model.
Note
Be wary of selecting models only based on their performance metrics! As we can see, this NN has problems even though the r2_score is fairly high.
# plot best RF model
model_type='RF'
model_path=result_df[result_df.model_type==model_type].iloc[0].model_path
pp.plot_pred_vs_actual_from_file(model_path)
print('\nBest valid r2 score: ',result_df[result_df.model_type==model_type].iloc[0].best_valid_r2_score)
print('\nModel Parameters: ',result_df[result_df.model_type==model_type].iloc[0].model_parameters_dict,'\n')
Best valid r2 score: 0.5595899501867392
Model Parameters: {"rf_estimators": 129, "rf_max_depth": 32, "rf_max_features": 95}
This RF model looks like it did better at training than the best NN model, even though its performance validation score is slightly lower. The low \(pKi\) values are learned more accurately in the training set, and the censored data at \(pKi\) =5 is also predicted more accurately.
# plot best xgboost model
model_type='xgboost'
model_path=result_df[result_df.model_type==model_type].iloc[0].model_path
pp.plot_pred_vs_actual_from_file(model_path)
print('\nBest valid r2 score: ',result_df[result_df.model_type==model_type].iloc[0].best_valid_r2_score)
print('\nModel Parameters: ',result_df[result_df.model_type==model_type].iloc[0].model_parameters_dict,'\n')
Best valid r2 score: 0.5031490908520113
Model Parameters: {"xgb_colsample_bytree": 1.0, "xgb_gamma": 0.0019288871251215423, "xgb_learning_rate": 0.2158168689218416, "xgb_max_depth": 6, "xgb_min_child_weight": 1.0, "xgb_n_estimators": 100, "xgb_subsample": 1.0}
This XGBoost model learns the low \(pKi\) values better but still suffers from problems with predicting the censored data.
Moving forward, we would select the RF model as the best performer.
In Tutorial 7, “Train a Production Model”, we will use the best-performing parameters to create a production model for the entire dataset.
If you have specific feedback about a tutorial, please complete the AMPL Tutorial Evaluation.