#################################################### 06 Compare Models to Select the Best Hyperparameters #################################################### *Published: June, 2024, ATOM DDM Team* Please check out the companion tutorial video: |youtube-image| .. |youtube-image| image:: ../_static/img/youtube_icon.png :alt: Tutorial 06 Compare Models :target: https://www.youtube.com/watch?v=fNdSZGtZjWk ------------ This tutorial will review `AMPL `_ functions for visualizing the results of a hyperparameter search in order to find the optimal hyperparameters for your model. After performing a hyperparameter search, it is prudent to examine each hyperparameter in order to determine the best combination before training a production model with all of the data. Additionally, it is good to explore multiple performance metrics and visualize the predictions instead of relying solely on metrics. For the purposes of this tutorial, we simply ran **Tutorial 5, "Hyperparameter Optimization"**, with different parameters such as those outlined `here `_ to get enough models for comparison. Specifically, we created additional **NN** and `XGBoost `_ models as well as using **fingerprint** and **scaffold splits**. If you don't want to run that many models, you can use the ``result_df`` saved here: ``dataset/SLC6A3_models/07_example_pred_df.csv``. In this tutorial, we will focus on these functions: - `plot_train_valid_test_scores `_ - `plot_split_perf `_ - `plot_hyper_perf `_ - `get_score_types `_ - `plot_xg_perf `_ - `plot_pred_vs_actual_from_file `_ Import Packages *************** .. code:: ipython3 from atomsci.ddm.pipeline import compare_models as cm from atomsci.ddm.pipeline import hyper_perf_plots as hpp from atomsci.ddm.pipeline import perf_plots as pp import pandas as pd pd.set_option('display.max_columns', None) # ignore warnings in tutorials import warnings warnings.filterwarnings('ignore', category=FutureWarning) warnings.filterwarnings('ignore', category=RuntimeWarning) Get Model Results and Filter **************************** First we pull the results of the hyperparameter search into a dataframe. In **Tutorial 5, "Hyperparameter Optimization"**, we used ``get_filesystem_perf_results()`` which packs hyperparameters in a dict in the column ``model_parameters_dict``. Here we use the individual hyperparameter columns to create visualizations. The ``result_df`` used here as an example is the result of calling ``get_filesystem_perf_results()`` once after training several hundred models with different parameters. These models were all saved in a single folder, but the function works iteratively so it can search an entire directory tree if a parent folder is passed. .. code:: ipython3 # # call the function yourself # result_df=cm.get_filesystem_perf_results(result_dir='dataset/SLC6A3_models/', pred_type='regression') # use example results result_df=pd.read_csv('dataset/SLC6A3_models/07_example_pred_df.csv', index_col=0) result_df=result_df.sort_values('best_valid_r2_score', ascending=False) print(result_df.shape) # show useful columns result_df[['model_uuid', 'split_uuid', 'best_train_r2_score', 'best_valid_r2_score', 'best_test_r2_score']].head() .. parsed-literal:: (467, 41) .. list-table:: :header-rows: 1 :class: tight-table * - - model_uuid - split_uuid - best_train_r2_score - best_valid_r2_score - best_test_r2_score * - 310 - b24a2887-8eca-43e2-8fc2-3642189d2c94 - c35aeaab-910c-4dcf-8f9f-04b55179aa1a - 0.776865 - 0.562091 - 0.519325 * - 306 - 9b6c9332-15f3-4f96-9579-bf407d0b69a8 - c35aeaab-910c-4dcf-8f9f-04b55179aa1a - 0.950677 - 0.559590 - 0.480895 * - 291 - d8e25713-bf6c-4ba0-917a-69f0ea1472b9 - c35aeaab-910c-4dcf-8f9f-04b55179aa1a - 0.930726 - 0.557156 - 0.497192 * - 334 - 1a3e4b13-f860-4acf-8300-e2ac5766caa6 - c35aeaab-910c-4dcf-8f9f-04b55179aa1a - 0.801391 - 0.556679 - 0.523567 * - 321 - 57a62e20-f18e-4afa-8ce2-315f0fcc950a - c35aeaab-910c-4dcf-8f9f-04b55179aa1a - 0.780693 - 0.556103 - 0.521270 We can look at a brief count of models for important parameters by creating a pivot table. Here we can see `ECFP fingerprints `_ and `RDKit `_ features and **fingerprint** and **scaffold splitters** were used for each model type. A fingerprint splitter provides a more stringent test of model performance by making sure the validation and test compounds are structurally dissimilar to the training set compounds. .. code:: ipython3 # model counts model_counts=pd.DataFrame(result_df.groupby(['features','splitter','model_type'])['model_uuid'].count()).reset_index() model_counts=model_counts.pivot(index='model_type',columns=['splitter','features',], values='model_uuid') model_counts .. list-table:: :header-rows: 3 :class: tight-table * - splitter - fingerprint - scaffold - fingerprint - scaffold * - features - ecfp - ecfp - rdkit_raw - rdkit_raw * - model_type - - - - * - **NN** - 26 - 29 - 25 - 96 * - **RF** - 30 - 30 - 30 - 32 * - **xgboost** - 47 - 26 - 20 - 76 Often, certain random combinations of hyperparameters result in terribly performing models. Here we will filter those out so they don't affect the visualization by only keeping models with a validation ``r2_score`` of 0.1 or greater. .. code:: ipython3 result_df.best_valid_r2_score.describe() .. parsed-literal:: count 4.670000e+02 mean -6.111789e+73 std 1.320769e+75 min -2.854206e+76 25% -2.751967e-01 50% 2.719028e-01 75% 4.323609e-01 max 5.620908e-01 Name: best_valid_r2_score, dtype: float64 .. code:: ipython3 # filter out objectively bad performing models result_df=result_df[result_df.best_valid_r2_score>0.1] result_df.shape .. parsed-literal:: (264, 41) .. code:: ipython3 result_df.best_valid_r2_score.describe() .. parsed-literal:: count 264.000000 mean 0.405931 std 0.108515 min 0.110739 25% 0.337459 50% 0.418931 75% 0.484987 max 0.562091 Name: best_valid_r2_score, dtype: float64 After filtering out models with extremely poor metrics, we can see that some combinations don't work at all, and are completely filtered from the set. For example, decision tree based models using `RDKit `_ or `ECFP `_ features work very poorly to predict on fingerprint-split models. .. code:: ipython3 # model counts model_counts=pd.DataFrame(result_df.groupby(['features','splitter','model_type'])['model_uuid'].count()).reset_index() model_counts=model_counts.pivot(index='model_type',columns=['splitter','features',], values='model_uuid') model_counts .. list-table:: :header-rows: 3 :class: tight-table * - splitter - fingerprint - scaffold - fingerprint - scaffold * - features - ecfp - ecfp - rdkit_raw - rdkit_raw * - model_type - - - - * - **NN** - 8.0 - 23.0 - 11.0 - 86.0 * - **RF** - NaN - 30.0 - NaN - 32.0 * - **xgboost** - 3.0 - 21.0 - NaN - 50.0 Visualize Hyperparameters ************************* There are several plotting functions in the ``hyper_perf_plots`` module that help visualize the different combinations of features for each type of model. Examine overall scores ---------------------- ``plot_train_valid_test_scores()`` gives a quick snapshot of your overall model performance. You can see if you overfitted and get a sense of whether your partitions are a good representation of future performance. Because the splitter can have a drastic effect on model performance, these plots are also separated by split type. Here we see a fairly typical pattern where the training set metrics are higher than validation and test partitions. It is good to see that the validation and test scores are similar across many models, indicating that the models should generalize to new data well. For fingerprint splits, we see an odd trend where the model performs *better* on the test set than the validation set (remember - we want to minimize MAE or RMSE!), suggesting that the split is problematic since the validation set does not necessarily reflect the generalization capability of the model accurately. .. code:: ipython3 hpp.plot_train_valid_test_scores(result_df, prediction_type='regression') .. image:: ../_static/img/06_compare_models_files/06_compare_models_15_0.png Examine Splits -------------- ``plot_split_perf()`` plots the performance of each split type, separated by feature type, for each performance metric. We can see that **fingerprint splits** perform much worse than **scaffold splits** for this dataset, and but `RDKit `_ and `ECFP `_ features perform differently. `ECFP `_ features work better for scaffold splits while `RDKit `_ features work better for **fingerprint splits**. Recalling the filtering from above, we know that `RDKit `_ features for **fingerprint splits** are only represented by **NN** models, which may skew these results. .. code:: ipython3 hpp.plot_split_perf(result_df, subset='valid') .. image:: ../_static/img/06_compare_models_files/06_compare_models_17_0.png General Model Features ====================== We also want to understand general hyperparameters like model type and feature type and their effect on performance. We can use ``plot_hyper_perf()`` with ``model_type='general'`` as a shortcut to visualize these. We can see that **random forests** or **neural networks** perform the best while `ECFP `_ features perform better than **RDKit**. Additionally, the **random forest** models are very consistent while there is more variability in the **NN** and **XGBoost** model performance. .. code:: ipython3 hpp.plot_hyper_perf(result_df, model_type='general') .. image:: ../_static/img/06_compare_models_files/06_compare_models_19_0.png RF-specific Hyperparameters --------------------------- We can also use ``plot_hyper_perf()`` to visualize model-specific hyperparameters. In this case we examine random forest models because they generally perform the best for this dataset. Here, we can see two distinct sets of ``valid_r2_scores`` (probably from **fingerprint** vs **scaffold split** models), but both sets show similar trends. For ``rf_estimators`` it looks like 100-150 trees is optimal, while ``rf_max_depth`` does worse below ~15 and improves slowly after that. ``rf_max_features`` doesn't show a clear trend except that below 50 might result in worse models. .. code:: ipython3 hpp.plot_hyper_perf(result_df, model_type='RF', subset='valid', scoretype='r2_score') .. image:: ../_static/img/06_compare_models_files/06_compare_models_21_0.png We can quickly get a list of scores to plot with ``get_score_types()`` and create the same plots with different metrics. .. code:: ipython3 hpp.get_score_types() .. parsed-literal:: Classification metrics: ['roc_auc_score', 'prc_auc_score', 'precision', 'recall_score', 'npv', 'accuracy_score', 'kappa', 'matthews_cc', 'bal_accuracy'] Regression metrics: ['r2_score', 'mae_score', 'rms_score'] .. code:: ipython3 hpp.plot_hyper_perf(result_df, model_type='RF', subset='valid', scoretype='mae_score') .. image:: ../_static/img/06_compare_models_files/06_compare_models_24_0.png NN Visualization ================ When visualizing hyperparameters of NN models in this case, it is slightly hard to see important trends because there is a large variance in their model performance. To avoid this, we use ``plot_hyper_perf()`` with a subsetted dataframe to look at a single combination of splitter and features. .. list-table:: :header-rows: 1 :class: tight-table * - Plot Features - Description * - `avg_dropout` - The average of dropout proportions across all layers of the model. This parameter can affect the generalizability and overfitting of the model and usually dropout of 0.1 or higher is best. * - `learning_rate` - The learning rate during training. Generally, learning rates that are ~10e-3 do best. * - `num_weights` - The product of layer sizes plus number of nodes in first layer, a rough estimate of total model size/complexity. This parameter should be minimized by selecting the smallest layer sizes possible that still maximize the preferred metric * - `num_layers` - The number of layers in the NN, another marker of complexity. This should also be minimized. * - `best_epoch` - Which epoch had the highest performance metric during training. This can indicate problematic training if the best_epochs are very small. * - `max_epochs` - The max number of epochs the model was allowed to train (although "early stopping" may have occurred). If the max_epochs is too small you may underfit your model. This could be shown by all of your best_epochs being at max_epoch. .. code:: ipython3 subsetted=result_df[result_df.splitter=='scaffold'] subsetted=subsetted[subsetted.features=='rdkit_raw'] hpp.plot_hyper_perf(subsetted, model_type='NN') .. image:: ../_static/img/06_compare_models_files/06_compare_models_26_0.png XGBoost Visualization ===================== Using ``plot_xg_perf()``, we can simultaneously visualize the two most important parameters for `XGBoost `_ models - the learning rate and gamma. We can see that ``xgb_learning_rate`` should be between 0 and 0.45, after which the performance starts to deteriorate. There's no clear trend for ``xgb_gamma``. We can additionally use ``plot_hyper_perf()`` to visualize more `XGBoost `_ parameters, but this is not shown here. .. code:: ipython3 # hpp.plot_hyper_perf(result_df, model_type='xgboost') .. code:: ipython3 hpp.plot_xg_perf(result_df) .. image:: ../_static/img/06_compare_models_files/06_compare_models_29_0.png Evaluation of a Single Model **************************** After calling ``compare_models.get_filesystem_perf_results()``, the dataframe can be sorted according to the score you care about. The column ``model_parameters_dict`` contains hyperparameters used for the best model. We can visualize this model using ``perf_plots.plot_pred_vs_actual_from_file()``. .. note:: *Not all scores should be maximized. For example, "mae_score" or "rms_score" should be minimized instead.* .. code:: ipython3 winnertype='best_valid_r2_score' # result_df=cm.get_filesystem_perf_results(result_dir='dataset/SLC6A3_models/', pred_type='regression') result_df=pd.read_csv('dataset/SLC6A3_models/07_example_pred_df.csv', index_col=0) result_df=result_df.sort_values(winnertype, ascending=False) result_df[['model_type','features','splitter',"dropouts",'best_train_r2_score','best_valid_r2_score','best_test_r2_score','model_uuid']].head() .. list-table:: :header-rows: 1 :class: tight-table * - - model_type - features - splitter - dropouts - best_train_r2_score - best_valid_r2_score - best_test_r2_score - model_uuid * - 310 - NN - ecfp - scaffold - 0.28,0.30,0.30 - 0.776865 - 0.562091 - 0.519325 - b24a2887-8eca-43e2-8fc2-3642189d2c94 * - 306 - RF - ecfp - scaffold - NaN - 0.950677 - 0.559590 - 0.480895 - 9b6c9332-15f3-4f96-9579-bf407d0b69a8 * - 291 - RF - ecfp - scaffold - NaN - 0.930726 - 0.557156 - 0.497192 - d8e25713-bf6c-4ba0-917a-69f0ea1472b9 * - 334 - NN - ecfp - scaffold - 0.40,0.26,0.00 - 0.801391 - 0.556679 - 0.523567 - 1a3e4b13-f860-4acf-8300-e2ac5766caa6 * - 321 - NN - ecfp - scaffold - 0.39,0.05,0.10 - 0.780693 - 0.556103 - 0.521270 - 57a62e20-f18e-4afa-8ce2-315f0fcc950a We can examine important parameters of the top model directly from the ``result_df``. We see that through hyperparameter optimization, we have increased our ``best_valid_r2_score`` to 0.56, as compared to our baseline model ``valid_r2_score`` of 0.50011 (from **Tutorial 3, "Train a Simple Regression Model"**). .. code:: ipython3 result_df.iloc[0][['features','splitter','best_valid_r2_score']] .. parsed-literal:: features ecfp splitter scaffold best_valid_r2_score 0.562091 Name: 310, dtype: object .. code:: ipython3 result_df.iloc[0].model_parameters_dict .. parsed-literal:: '{"best_epoch": 24, "dropouts": [0.27866421599874197, 0.3041982566364109, 0.29943876674824], "layer_sizes": [369, 283, 146], "learning_rate": 8.28816038984145e-05, "max_epochs": 100}' .. code:: ipython3 result_df.iloc[0].model_path .. parsed-literal:: 'dataset/SLC6A3_models/SLC6A3_Ki_curated_model_b24a2887-8eca-43e2-8fc2-3642189d2c94.tar.gz' Here we use ``plot_pred_vs_actual_from_file()`` to visualize the prediction accuracy for the train, validation and test sets. .. code:: ipython3 # plot best model, an NN import importlib importlib.reload(pp) model_path=result_df.iloc[0].model_path pp.plot_pred_vs_actual_from_file(model_path) .. parsed-literal:: 2024-06-25 18:59:07,633 dataset/SLC6A3_models/SLC6A3_Ki_curated_model_b24a2887-8eca-43e2-8fc2-3642189d2c94.tar.gz, 1.6.0 2024-06-25 18:59:07,634 Version compatible check: dataset/SLC6A3_models/SLC6A3_Ki_curated_model_b24a2887-8eca-43e2-8fc2-3642189d2c94.tar.gz version = "1.6", AMPL version = "1.6" .. parsed-literal:: ['/tmp/tmph9rlkgwn/best_model/checkpoint1.pt'] /tmp/tmph9rlkgwn/best_model/checkpoint1.pt .. image:: ../_static/img/06_compare_models_files/06_compare_models_38_2.png This NN model looks like it isn't very good at predicting things with :math:`pKi` < 4.5. Additionally, there is a set of data at :math:`pKi` =5 (this data is censored and all we know is that the compounds have a :math:`pKi` < 5 because higher concentrations of drug were not tested). This data is poorly predicted by the NN model. .. note:: *Be wary of selecting models only based on their performance metrics! As we can see, this NN has problems even though the r2_score is fairly high.* .. code:: ipython3 # plot best RF model model_type='RF' model_path=result_df[result_df.model_type==model_type].iloc[0].model_path pp.plot_pred_vs_actual_from_file(model_path) print('\nBest valid r2 score: ',result_df[result_df.model_type==model_type].iloc[0].best_valid_r2_score) print('\nModel Parameters: ',result_df[result_df.model_type==model_type].iloc[0].model_parameters_dict,'\n') .. parsed-literal:: Best valid r2 score: 0.5595899501867392 Model Parameters: {"rf_estimators": 129, "rf_max_depth": 32, "rf_max_features": 95} .. image:: ../_static/img/06_compare_models_files/06_compare_models_40_2.png This RF model looks like it did better at training than the best NN model, even though its performance validation score is slightly lower. The low :math:`pKi` values are learned more accurately in the training set, and the censored data at :math:`pKi` =5 is also predicted more accurately. .. code:: ipython3 # plot best xgboost model model_type='xgboost' model_path=result_df[result_df.model_type==model_type].iloc[0].model_path pp.plot_pred_vs_actual_from_file(model_path) print('\nBest valid r2 score: ',result_df[result_df.model_type==model_type].iloc[0].best_valid_r2_score) print('\nModel Parameters: ',result_df[result_df.model_type==model_type].iloc[0].model_parameters_dict,'\n') .. parsed-literal:: Best valid r2 score: 0.5031490908520113 Model Parameters: {"xgb_colsample_bytree": 1.0, "xgb_gamma": 0.0019288871251215423, "xgb_learning_rate": 0.2158168689218416, "xgb_max_depth": 6, "xgb_min_child_weight": 1.0, "xgb_n_estimators": 100, "xgb_subsample": 1.0} .. image:: ../_static/img/06_compare_models_files/06_compare_models_42_2.png This `XGBoost `_ model learns the low :math:`pKi` values better but still suffers from problems with predicting the censored data. Moving forward, we would select the **RF** model as the best performer. In **Tutorial 7, "Train a Production Model"**, we will use the best-performing parameters to create a production model for the entire dataset. If you have specific feedback about a tutorial, please complete the `AMPL Tutorial Evaluation `_.