05 Hyperparameter Optimization

Published: June, 2024, ATOM DDM Team

Please check out the companion tutorial video: Tutorial 05 Hyperparameter Optimization


Hyperparameters dictate the parameters of the training process and the architecture of the model itself. For example, the number of random trees is a hyperparameter for a random forest. In contrast, a learned parameter for a random forest is the set of features that is contained in a single node (in a single tree) and the cutoff values for each of those features that determines how the data is split at that node. A full discussion of hyperparameter optimization can be found on Wikipedia.

The choice of hyperparameters strongly influences model performance, so it is important to be able to optimize them as well. AMPL offers a variety of hyperparameter optimization methods including random sampling, grid search, and Bayesian optimization. Please refer to the parameter documentation page for further information.

In this tutorial we demonstrate the following:

  • Build a parameter dictionary to perform a hyperparameter search for a random forest using Bayesian optimization.

  • Perform the optimization process.

  • Review the results

We will use these AMPL functions:

The first three functions in the above list come from the hyperparameter_search_wrapper module.

Set Up Directories

Here we set up a few important variables corresponding to required directories and specific features for the hyperparameter optimization (HPO) process. Then, we ensure that the directories are created before saving models into them.

Variable

Description

dataset_key

The relative path to the dataset you want to use for HPO

descriptor_type

The type of features you want to use during HPO

model_dir

The directory where you want to save all of the models

best_model_dir

For Bayesian optimization, the winning model is saved in this separate folder

split_uuid

The presaved split uuid from Tutorial 3, “Splitting Datasets for Validation and Testing”

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)

import os

dataset_key='dataset/SLC6A3_Ki_curated.csv'
descriptor_type = 'rdkit_raw'
model_dir = 'dataset/SLC6A3_models'
best_model_dir = 'dataset/SLC6A3_models/best_models'
split_uuid = "c35aeaab-910c-4dcf-8f9f-04b55179aa1a"


if not os.path.exists(f'./{best_model_dir}'):
    os.mkdir(f'./{best_model_dir}')

if not os.path.exists(f'./{model_dir}'):
    os.mkdir(f'./{model_dir}')

To run a hyperparameter search, we first create a parameter dictionary with parameter settings that will be common to all models, along with some special parameters that control the search and indicate which parameters will be varied and how. The table below describes the special parameter settings for our random forest search.

Parameter Dictionary Settings

Parameter

Description

‘hyperparam’:’True’

This setting indicates that we are performing a hyperparameter search instead of just training one model.

‘previously_featurized’:True’

This tells AMPL to search for previously generated features in ../dataset/scaled_descriptors instead of regenerating them on the fly.

‘search_type’:’hyperopt’

This specifies the hyperparameter search method. Other options include grid, random, and geometric. Specifications for each hyperparameter search method is different, please refer to the full documentation. Here we are using the Bayesian optimization method.

‘model_type’:’RF|10’

This means AMPL will try 10 times to find the best set of hyperparameters using random forests. In practice, this parameter could be set to 100 or more.

‘rfe’:’uniformint|8,512’

The Bayesian optimizer will uniformly search between 8 and 512 for the best number of random forest estimators. Similarly rfd stands for random forest depth and rff stands for random forest features.

‘result_dir’

Now expects two parameters. The first directory will contain the best trained models while the second directory will contain all models trained in the search.

Regression models are optimized to maximize the \(R^2\) and classification models are optimized using area under the receiver operating characteristic curve. A full list of parameters can be found on our github.

params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    "splitter":"scaffold",
    "split_uuid": split_uuid,
    "previously_split": "True",

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "transformers": "True",

    "search_type": "hyperopt",
    "model_type": "RF|10",
    "rfe": "uniformint|8,512",
    "rfd": "uniformint|6,32",
    "rff": "uniformint|8,200",

    "result_dir": f"./{best_model_dir},./{model_dir}"
}

Examples of Other Parameter Sets

Below are some parameters that can be used for neural networks, XGBoost models, fingerprint splits and ECFP features. Each set of parameters can be used to replace the parameters above. Trying them out is left as an exercise for the reader.

XGBoost

  • xgbg Stands for xgb_gamma and controls the minimum loss reduction required to make a further partition on a leaf node of the tree.

  • xgbl Stands for xgb_learning_rate and controls the boosting learning rate searching domain of XGBoost models.

params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    "splitter":"scaffold",
    "split_uuid": split_uuid,
    "previously_split": "True",

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "transformers": "True",

    ### Use an XGBoost model
    "search_type": "hyperopt",
    "model_type": "xgboost|10",
    "xgbg": "uniform|0,0.2",
    "xgbl": "loguniform|-2,2",
    ###

    "result_dir": f"./{best_model_dir},./{model_dir}"
}

Fingerprint Split

This trains an XGBoost model using a provided fingerprint split.

fp_split_uuid="be60c264-6ac0-4841-a6b6-41bf846e4ae4"

params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    ### Use a fingerprint split
    "splitter":"fingerprint",
    "split_uuid": fp_split_uuid,
    "previously_split": "True",
    ###

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "transformers": "True",

    "search_type": "hyperopt",
    "model_type": "xgboost|10",
    "xgbg": "uniform|0,0.2",
    "xgbl": "loguniform|-2,2",

    "result_dir": f"./{best_model_dir},./{model_dir}"
}

ECFP Features

This uses an XGBoost model with ECFP fingerprints features and a scaffold split.

params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    "splitter":"scaffold",
    "split_uuid": split_uuid,
    "previously_split": "True",

    ### Use ECFP Features
    "featurizer": "ecfp",
    "ecfp_radius" : 2,
    "ecfp_size" : 1024,
    "transformers": "True",
    ###

    "search_type": "hyperopt",
    "model_type": "xgboost|10",
    "xgbg": "uniform|0,0.2",
    "xgbl": "loguniform|-2,2",

    "result_dir": f"./{best_model_dir},./{model_dir}"
}

In Tutorial 6, “Compare Models to Select the Best Hyperparameters”, we analyze the performance of these large sets of models to select the best hyperparameters for production models.

If you have specific feedback about a tutorial, please complete the AMPL Tutorial Evaluation.