07 Train a Production Model

Published: June, 2024, ATOM DDM Team

Please check out the companion tutorial video: Tutorial 07 Train Production Model


In this tutorial, you will use the existing hyperparameters that yielded the best performing model in the previous Tutorial 6, “Compare models to select the best hyperparameters” to train a production model on the full dataset, rather than just the training subset from a scaffold split. Training a model on the full dataset is one approach for creating a production model. Another approach, not demonstrated in this tutorial, is to combine the training and validation subsets for training and evaluating with the test subset. K-fold cross validation makes use of this approach. The production model could be shared with other researchers to predict on new data.

We will use the functions below to retrain the best original model for production and compare the production model with the original model. We use the word “retrain” here to describe creating a new model based on the same hyperparameters; retraining doesn’t replace the existing model.

We will use functions covered in Tutorial 3, “Train a Simple Regression Model” to evaluate the original and production models on an external test dataset.

We will focus on these functions in this tutorial:

Note

1. “train_model_from_tar” and other functions in the “model_retrain” module can be used to update a previously trained model when there is a new AMPL release that is not compatible with previous versions. This is not covered in this tutorial.

2. When a model input dataset is updated with additional data, the model should be trained from scratch with a new hyperparameter optimization run; then a new version of the production model can be generated.

Import Packages

import pandas as pd
import os
from atomsci.ddm.utils import model_retrain as mr
from atomsci.ddm.utils import model_version_utils as mv
from atomsci.ddm.pipeline import model_pipeline as mp
from atomsci.ddm.pipeline import predict_from_model as pfm
from atomsci.ddm.pipeline import perf_plots as pp
from sklearn.metrics import r2_score
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

Start From Saved Best Model

We’re using the best performing model by validation set \(R^2\) from Tutorial 6, “Compare models to select the best hyperparameters”. This was a random forest model. The model path, validation set \(R^2\) and RF-specific parameters were as follows:

Parameter

Value

Model path

dataset/SLC6A3_models/SLC6A3_Ki_curated_model_9b6c9332-15f3-4f96-9579-bf407d0b69a8.tar.gz

Best valid r2 score

0.5595899501867392

Model Parameters

{“rf_estimators”: 129, “rf_max_depth”: 32, “rf_max_features”: 95}

When you load a previously trained model, either for running predictions or retraining, the parameters used to train the model are obtained from a file called ‘model_metadata.json’ that is stored in the model’s .tar.gz file. The production parameter controls whether the data is split into training, validation and test sets. During normal training and hyperparameter optimization, the production parameter is set False; setting it True causes all splitting related parameters to be ignored. When you call train_model_from_tar with the production argument set True, the code overrides the production parameter in the model parameters to set it True, so that all the data is used for training.

In the following code, create_prediction_pipeline_from_file is used to reload the original best model. We then verify that it was trained with production set to False.

We also check the AMPL version to make sure the saved model was trained with a version of AMPL that is compatible with the installed version. The versions must be compatible in order to be used by create_prediction_pipeline_from_file, which we will be calling later to run predictions from the saved model.

#Tutorial 7 RF saved model_path
best_model_path='dataset/SLC6A3_models/SLC6A3_Ki_curated_model_9b6c9332-15f3-4f96-9579-bf407d0b69a8.tar.gz'

#get installed AMPL version
print("installed AMPL version: " + str(mv.get_ampl_version()))

#get AMPL model version
print("best model AMPL version: " + str(mv.get_ampl_version_from_model(best_model_path)))

# check versions are compatible
assert(mv.check_version_compatible(best_model_path))

#load best model production params
best_model_pipe = mp.create_prediction_pipeline_from_file(params=None, reload_dir=None, model_path=best_model_path, model_type='best_model', featurization=None, verbose=False)

#show production is false
print("orig_params.production: " + str(best_model_pipe.orig_params.production))
installed AMPL version: 1.6.1
best model AMPL version: 1.6.0
orig_params.production: False

Retrain Best Model as Production Model

Setting the production argument for train_model_from_tar to True will set production=True in the model params. If the production argument for train_model_from_tar is False, the model will be retrained without changing any parameters. Note the production model’s model parameter production is set to True.

odir='dataset/SLC6A3_models'

production_model = mr.train_model_from_tar(input=best_model_path, output=odir, production=True)

#check for parameters
print("production_model.params.production: " + str(production_model.params.production))
print("production_model.params.model_tarball_path: " + str(production_model.params.model_tarball_path))

print("production model AMPL version: " + str(mv.get_ampl_version_from_model(production_model.params.model_tarball_path)))
production_model.params.production: True
production_model.params.model_tarball_path: dataset/SLC6A3_models/SLC6A3_Ki_curated_model_4f266a2a-64ce-46fd-9c43-d2f14720f788.tar.gz
production model AMPL version: 1.6.1

Compare Performance on a Separate External Test Dataset

Here we will apply Tutorial 4, “Application of a Trained Model”’s steps to run predictions with the original best model and the production model, using an independent dataset of compounds that are structurally different (with Tanimoto distance > 0.4) from all compounds in the production dataset. We use this approach to compare the performance of the two models. Since the production model is trained on all data, including the test subset, the original test subset should not be used to evaluate its performance.

First we’ll load the external test dataset, which we’ve already featurized with RDKit descriptors:

test_file_path = 'dataset/scaled_descriptors/SLC6A3_Ki_ext_test_data_with_rdkit_raw_descriptors.csv'
test_data = pd.read_csv(test_file_path)

# show most important columns
test_data[['compound_id', 'base_rdkit_smiles', 'avg_pKi']].head()

compound_id

base_rdkit_smiles

avg_pKi

0

compound_346

OC(C[NH2+]C1CCC1)C1(c2ccc(Cl)c(Cl)c2)CCC1

7.958607

1

compound_225

CN1Cc2ccccc2C(C)(c2ccc3[nH]ncc3c2)C1

6.587660

2

compound_166

O=C(O)C(/C=C/c1ccccc1)C1CCN(CCOC(c2ccccc2)c2cc…

5.430275

3

compound_310

CN1Cc2cc(-c3cccnn3)ccc2C(C)(c2cc3ccccc3[nH]2)C1

6.000000

4

compound_284

CN1Cc2ccccc2C(F)(c2ccc3sccc3c2)C1

6.587660

We now predict \(pK_i\) values with the original best model:

id_col = 'compound_id'
smiles_col = 'base_rdkit_smiles'
response_col = 'avg_pKi'

best_pred_df = pfm.predict_from_model_file(model_path = best_model_path,
                                      input_df = test_data,
                                      id_col = id_col ,
                                      smiles_col = smiles_col,
                                      response_col = response_col,
                                      is_featurized=False) #throws error if is_featurized=True

# show most important columns
best_pred_df[['compound_id', 'base_rdkit_smiles', 'avg_pKi', 'avg_pKi_actual', 'avg_pKi_pred', 'avg_pKi_std']].head()
Standardizing SMILES strings for 533 compounds.

compound_id

base_rdkit_smiles

avg_pKi

avg_pKi_actual

avg_pKi_pred

avg_pKi_std

0

compound_346

OC(C[NH2+]C1CCC1)C1(c2ccc(Cl)c(Cl)c2)CCC1

7.958607

7.958607

7.284956

0.955853

1

compound_225

CN1Cc2ccccc2C(C)(c2ccc3[nH]ncc3c2)C1

6.587660

6.587660

7.143886

0.801133

2

compound_166

O=C(O)C(/C=C/c1ccccc1)C1CCN(CCOC(c2ccccc2)c2cc…

5.430275

5.430275

7.676473

0.947577

3

compound_310

CN1Cc2cc(-c3cccnn3)ccc2C(C)(c2cc3ccccc3[nH]2)C1

6.000000

6.000000

6.379872

1.014355

4

compound_284

CN1Cc2ccccc2C(F)(c2ccc3sccc3c2)C1

6.587660

6.587660

6.949374

0.964244

Now we’ll run predictions on the same dataset with the production model:

prod_pred_df = pfm.predict_from_model_file(model_path = production_model.params.model_tarball_path,
                                      input_df = test_data,
                                      id_col = id_col ,
                                      smiles_col = smiles_col,
                                      response_col = response_col,
                                      is_featurized=False)

# show most important columns
prod_pred_df[['compound_id', 'base_rdkit_smiles', 'avg_pKi', 'avg_pKi_actual', 'avg_pKi_pred', 'avg_pKi_std']].head()
Standardizing SMILES strings for 533 compounds.

compound_id

base_rdkit_smiles

avg_pKi

avg_pKi_actual

avg_pKi_pred

avg_pKi_std

0

compound_346

OC(C[NH2+]C1CCC1)C1(c2ccc(Cl)c(Cl)c2)CCC1

7.958607

7.958607

7.273634

0.846191

1

compound_225

CN1Cc2ccccc2C(C)(c2ccc3[nH]ncc3c2)C1

6.587660

6.587660

6.546206

1.096473

2

compound_166

O=C(O)C(/C=C/c1ccccc1)C1CCN(CCOC(c2ccccc2)c2cc…

5.430275

5.430275

7.063624

0.735218

3

compound_310

CN1Cc2cc(-c3cccnn3)ccc2C(C)(c2cc3ccccc3[nH]2)C1

6.000000

6.000000

6.230782

1.045009

4

compound_284

CN1Cc2ccccc2C(F)(c2ccc3sccc3c2)C1

6.587660

6.587660

6.676182

1.178339

To compare the performance of the production model with the original best model, we’ll compute the \(R^2\) scores for the predictions from each model and then plot the predicted vs actual values:

best_r2 = np.round(r2_score(best_pred_df.avg_pKi_actual.values, best_pred_df.avg_pKi_pred.values), 6)
prod_r2 = np.round(r2_score(prod_pred_df.avg_pKi_actual.values, prod_pred_df.avg_pKi_pred.values), 6)
print("Best model r2_score: " + str(best_r2))
print("Production model r2_score: " + str(prod_r2))
Best model r2_score: 0.156877
Production model r2_score: 0.271881
fig, ax = plt.subplots(1,2, figsize=(12,6))
pp.plot_pred_vs_actual_from_df(best_pred_df, actual_col='avg_pKi_actual', pred_col='avg_pKi_pred',
    label=f"Best model, $R^2$ = {best_r2:.3f}", ax=ax[0])
pp.plot_pred_vs_actual_from_df(prod_pred_df, actual_col='avg_pKi_actual', pred_col='avg_pKi_pred',
    label=f"Production model, $R^2$ = {prod_r2:.3f}", ax=ax[1])
fig.tight_layout(pad=3.0)
fig.show()
../_images/07_train_production_model_16_1.png

Although neither model has a great \(R^2\) score, the production model does perform better, with \(R^2\) = 0.267 vs 0.157 for the original best model. Also, the points in the production model plot are slightly more concentrated along the diagonal. A possible explanation for the mediocre performance is that the external dataset compounds were filtered so that none have Tanimoto distance < 0.4 to any compound in the original model dataset, so that the test set compounds are outside of the applicability domain of both models. We expect that the models’ performance would improve on a dataset filtered with a smaller Tanimoto distance threshold.

Developing models that generalize well to diverse sets of compounds (i.e., that have a broader applicability domain) is one of the major challenges in machine learning for chemistry. Training a production model is one approach to this problem. To do better we may need to explore other model types or methods of featurizing molecules, with additional rounds of hyperparameter optimization.

Other Functions With Production Parameters

A boolean production parameter is available in these other functions in the AMPL model_retrain module. If production is set to True, the model will be trained in production mode, using the entire dataset for training. Note that for neural network models, the model will be trained for the number of epochs corresponding to the best epoch from the original model training run.

In Tutorial 8, “Visualizations of Model Performances”, we’ll explore a wide range of methods for visualizing and evaluating the performance of AMPL models.

If you have specific feedback about a tutorial, please complete the AMPL Tutorial Evaluation.