03 Train a Simple Regression Model¶

Published: June, 2024, ATOM DDM Team

Please check out the companion tutorial video:

The process of training a machine learning (ML) model can be thought of as fitting a highly parameterized function to map inputs to outputs. An ML algorithm needs to train numerous examples of input and output pairs to accurately map an input to an output, i. e., make a prediction. After training, the result is referred to as a trained ML model or an artifact.

This tutorial will detail how we can use AMPL tools to train a regression model to predict how much a compound will inhibit the SLC6A3 protein as measured by \(pK_i\). We will train a random forest model using the following inputs:

The curated SLC6A3 dataset from Tutorial 1, “Data Curation”.
The split file generated in Tutorial 2, “Splitting Datasets for Validation and Testing”.
RDKit features calculated by the AMPL pipeline.

The tutorial will present the following functions and classes:

We will explain the use of descriptors, how to evaluate model performance, and where the model is saved as a .tar.gz file.

Note

Training a random forest model and splitting the dataset are non-deterministic. You will obtain a slightly different random forest model by running this tutorial each time.

Model Training (Using Previously Split Data)¶

In our first example, we train a model using a curated dataset (as described in Tutorial 1, “Data Curation”) that was already split using the procedure in Tutorial 2, “Splitting Datasets for Validation and Testing”. To use an existing split file, we specify its split_uuid in the model parameters and set the previously_split parameter to True. In the example code, we set split_uuid to point to a split file provided with AMPL, in case you’re running this tutorial without having previously done Tutorial 2, “Splitting Datasets for Validation and Testing”.

Here, we will use "split_uuid": "c35aeaab-910c-4dcf-8f9f-04b55179aa1a" which is saved in dataset/ as a convenience for these tutorials.

AMPL provides an extensive featurization module that can generate a variety of molecular feature types, given SMILES strings as input. For demonstration purposes, we choose to use RDKit features in this tutorial.

When the featurized dataset is not previously saved for SLC6A3_Ki, AMPL will create a featurized dataset and save it in a folder called scaled_descriptors as a csv file e.g. dataset/scaled_descriptors/SLC6A3_Ki_curated_with_rdkit_raw_descriptors.csv.

After training, AMPL saves the model and all of its parameters as a tarball in the directory given by result_dir.

# importing relevant libraries
import pandas as pd
pd.set_option('display.max_columns', None)
from atomsci.ddm.pipeline import model_pipeline as mp
from atomsci.ddm.pipeline import parameter_parser as parse

# Set up
dataset_file = 'dataset/SLC6A3_Ki_curated.csv'
odir='dataset/SLC6A3_models/'

response_col = "avg_pKi"
compound_id = "compound_id"
smiles_col = "base_rdkit_smiles"
split_uuid = "c35aeaab-910c-4dcf-8f9f-04b55179aa1a"

params = {
        "prediction_type": "regression",
        "dataset_key": dataset_file,
        "id_col": compound_id,
        "smiles_col": smiles_col,
        "response_cols": response_col,
        "previously_split": "True",
        "split_uuid" : split_uuid,
        "split_only": "False",
        "featurizer": "computed_descriptors",
        "descriptor_type" : "rdkit_raw",
        "model_type": "RF",
        "verbose": "True",
        "transformers": "True",
        "rerun": "False",
        "result_dir": odir
    }

ampl_param = parse.wrapper(params)
pl = mp.ModelPipeline(ampl_param)
pl.train_model()

Model Training (Split Data and Train)¶

AMPL also provides an option to split a dataset and train a model in one step, by setting the previously_split parameter to False and omitting the split_uuid parameter. AMPL splits the data by the type of split specified in the splitter parameter, scaffold in this example, and writes the split file in dataset/SLC6A3_Ki_curated_train_valid_test_scaffold_{split_uuid}.csv

Although it’s convenient, it is not a good idea to use the one-step option if you intend to train multiple models with different parameters on the same dataset and compare their performance. If you do, you will end up with different splits for each model, and won’t be able to tell if the differences in performance are due to the parameter settings or to the random variations between splits.

params = {
        "prediction_type": "regression",
        "dataset_key": dataset_file,
        "id_col": compound_id,
        "smiles_col": smiles_col,
        "response_cols": response_col,

        "previously_split": "False",
        "split_only": "False",
        "splitter": "scaffold",
        "split_valid_frac": "0.15",
        "split_test_frac": "0.15",

        "featurizer": "computed_descriptors",
        "descriptor_type" : "rdkit_raw",
        "model_type": "RF",
        "transformers": "True",
        "rerun": "False",
        "result_dir": odir
    }

ampl_param = parse.wrapper(params)
pl = mp.ModelPipeline(ampl_param)
pl.train_model()

Performance of the Model¶

We evaluate model performance by measuring how accurate models are on validation and test sets. The validation set is used while optimizing the model and choosing the best parameter settings. Finally, we use the model’s performance on the test set to judge the model.

AMPL has several popular metrics to evaulate regression models; Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and \(R^2\) (R-Squared). In our tutorials, we will use \(R^2\) metric to compare our models. The best model will have the highest \(R^2\) score.

# Model Performance
from atomsci.ddm.pipeline import compare_models as cm
perf_df = cm.get_filesystem_perf_results(odir, pred_type='regression')

Found data for 2 models under dataset/SLC6A3_models/

The perf_df dataframe has details about the model_uuid, model_path, ampl_version, model_type, features, splitterand the results for popular metrics that help evaluate the performance. Let us view the contents of the perf_df dataframe.

# save perf_df
import os
perf_df.to_csv(os.path.join(odir, 'perf_df.csv'))

# View the perf_df dataframe

# show most useful columns
perf_df[['model_uuid', 'split_uuid', 'best_train_r2_score', 'best_valid_r2_score', 'best_test_r2_score']]

	model_uuid	split_uuid	best_train_r2_score	best_valid_r2_score	best_test_r2_score
0	9ff5a924-ef49-407c-a4d4-868a1288a67e	c35aeaab-910c-4dcf-8f9f-04b55179aa1a	0.949835	0.500110	0.426594
1	f69409b0-33ce-404f-b1e5-0e9f5128ebc7	f6351696-363f-411a-8720-4892bc4f700e	0.949919	0.472619	0.436174

Finding the Top Performing Model¶

To pick the top performing model, we sort the performance table by best_valid_r2_score in descending order and examine the top row.

# Top performing model
top_model=perf_df.sort_values(by="best_valid_r2_score", ascending=False).iloc[0]
top_model

model_uuid                               9ff5a924-ef49-407c-a4d4-868a1288a67e
model_path                  dataset/SLC6A3_models/SLC6A3_Ki_curated_model_...
ampl_version                                                            1.6.1
model_type                                                                 RF
dataset_key                 /Users/rwilfong/Downloads/2024_LLNL/fork_ampl/...
features                                                            rdkit_raw
splitter                                                             scaffold
split_strategy                                               train_valid_test
split_uuid                               c35aeaab-910c-4dcf-8f9f-04b55179aa1a
model_score_type                                                           r2
feature_transform_type                                          normalization
weight_transform_type                                                    None
model_choice_score                                                    0.50011
best_train_r2_score                                                  0.949835
best_train_rms_score                                                  0.27884
best_train_mae_score                                                 0.198072
best_train_num_compounds                                                 1273
best_valid_r2_score                                                   0.50011
best_valid_rms_score                                                 0.854443
best_valid_mae_score                                                 0.700053
best_valid_num_compounds                                                  273
best_test_r2_score                                                   0.426594
best_test_rms_score                                                   0.92241
best_test_mae_score                                                  0.746781
best_test_num_compounds                                                   273
rf_estimators                                                             500
rf_max_features                                                            32
rf_max_depth                                                             None
max_epochs                                                                NaN
best_epoch                                                                NaN
learning_rate                                                             NaN
layer_sizes                                                               NaN
dropouts                                                                  NaN
xgb_gamma                                                                 NaN
xgb_learning_rate                                                         NaN
xgb_max_depth                                                             NaN
xgb_colsample_bytree                                                      NaN
xgb_subsample                                                             NaN
xgb_n_estimators                                                          NaN
xgb_min_child_weight                                                      NaN
model_parameters_dict       {"rf_estimators": 500, "rf_max_depth": null, "...
feat_parameters_dict                                                       {}
Name: 0, dtype: object

You can find the path to the .tar.gz file (“tarball”) where the top performing model is saved by examining top_model.model_path. You will need this path to run predictions with the model at a later time.

# Top performing model path
top_model.model_path

'dataset/SLC6A3_models/SLC6A3_Ki_curated_model_9ff5a924-ef49-407c-a4d4-868a1288a67e.tar.gz'

In Tutorial 4 , “Application of a Trained Model”, we will learn how to use a selected model to make predictions and evaluate those predictions

If you have specific feedback about a tutorial, please complete the AMPL Tutorial Evaluation.