################################################ 02 Splitting Datasets for Validation and Testing ################################################ *Published: June, 2024, ATOM DDM Team* Please check out the companion tutorial video: |youtube-image| .. |youtube-image| image:: ../_static/img/youtube_icon.png :alt: Tutorial 02 Perform Split :target: https://www.youtube.com/watch?v=gsa2xfG3OSE ------------ A common problem with machine learning (ML) models is that they can very easily "overfit" the training data. This means that the model predicts the response values for training set compounds with perfect accuracy, but fails miserably on molecules that differ from the training compounds. To avoid overfitting and provide a way to test a model's ability to generalize to new molecules, ML researchers have developed a variety of data splitting and training schemes. `AMPL `_ supports two of the most popular strategies: - 3-way training/validation/test splits - *k*-fold cross-validation In this tutorial we will perform a 3-way split of the curated dataset we prepared in **Tutorial 1, "Data Curation"**, using the `AMPL `_ modules, classes and functions listed below. *k*-fold cross-validation will be addressed in a future tutorial. - `parameter_parser `_ - `ModelPipeline `_ - `split_dataset `_ - `compare_splits_plots `_ - `split_response_dist_plots `_ With 3-way data splitting, you divide your curated dataset into three subsets: .. list-table:: :header-rows: 1 :class: tight-table * - Subset - Description * - **Training set** - Usually the largest subset. `AMPL `_ feeds the training set compound features and response values in batches to the model fitting algorithm. The fitting algorithm iteratively adjusts the model parameters after each batch so that the predicted responses are close (on average) to the actual response values. * - **Validation set** - Used after training a collection of models to see how well each one performs on "new" compounds that weren't used directly to fit the model parameters, so you can choose the best model. The validation set is also used by AMPL during **neural network model** training to implement "early stopping", a trick to avoid overfitting the training set. * - **Test set** - After training is completed, `AMPL `_ scores the predictions on the test set compounds to provide a measure of the final model's performance. .. image:: ../_static/img/02_perform_a_split_files/02_split_example_figure.png .. code:: ipython3 import pandas as pd # Set up dataset_file = 'dataset/SLC6A3_Ki_curated.csv' odir = 'dataset' # Set for less chatty log messages import logging logger = logging.getLogger('ATOM') logger.setLevel(logging.INFO) Splitting Methods ***************** `AMPL `_ supports a variety of splitting algorithms, including **random** and **scaffold splits**. A **scaffold** is the core structure of a molecule, with its side chains removed. **Scaffold splits** assign molecules to the training, validation and test sets, so that molecules with the same scaffold group together in the same subset. This ensures that compounds in the validation and test sets have different scaffolds from those in the training set, and are thus more likely to be structurally different. By contrast, a random split assigns molecules to subsets randomly. Rationale for Using Scaffold vs Random Splits ============================================= A **scaffold split** is more challenging for model fitting than a **random split**. With a random split, many test set compounds may be similar to molecules in the training set, so a model may *appear* to perform well when it is simply "memorizing" training compound structures associated with different response levels. Such a model will perform badly on molecules that truly are different from the training compounds. However, a model trained on molecules that belong to a limited set of scaffold classes has to learn combinations of features that generalize across many chemical families to make accurate predictions on compounds with novel scaffolds. A scaffold split provides a way to select models with greater generalization ability and assess their performance realistically. Performing a Split ****************** We start by constructing a dictionary of parameter values: .. code:: ipython3 params = { # dataset info "dataset_key" : dataset_file, "response_cols" : "avg_pKi", "id_col": "compound_id", "smiles_col" : "base_rdkit_smiles", "result_dir": odir, # splitting "previously_split": "False", "splitter": 'scaffold', "split_valid_frac": "0.15", "split_test_frac": "0.15", # featurization & training params "featurizer": "computed_descriptors", "descriptor_type" : "rdkit_raw", "previously_featurized": "True", } We parse the ``params`` dict with the ``parameter_parser`` module to create a parameter object for input to `AMPL `_ functions. We then create a ``ModelPipeline`` object and call its ``split_dataset`` method to do the actual split. .. note:: *"split_dataset()" can also featurize the dataset; we will explore featurization in a later tutorial. For now, we provide prefeaturized data in the "./dataset/scaled_descriptors" folder* .. code:: ipython3 from atomsci.ddm.pipeline import model_pipeline as mp from atomsci.ddm.pipeline import parameter_parser as parse pparams = parse.wrapper(params) MP = mp.ModelPipeline(pparams) split_uuid = MP.split_dataset() The dataset split table is saved as a .csv in the same directory as the ``dataset_key``. The name of the split file starts with the ``dataset_key`` and is followed by the ``split_strategy`` (train\_valid\_test), ``split type`` (scaffold), and the ``split_uuid`` (a unique identifier of the split). .. code:: ipython3 # display the split file location import glob import os dirname = os.path.dirname(params['dataset_key']) split_file = glob.glob(f"{dirname}/*{split_uuid}*")[0] split_file .. parsed-literal:: 'dataset/SLC6A3_Ki_curated_train_valid_test_scaffold_640d807b-f58a-47f3-913d-4a60db0a9dbd.csv' Format of the Split File ************************ The split file consists of three columns: ``cmpd_id`` is the compound ID; ``subset`` tells you if the compound is in the train, validation, or test set and ``fold`` contains the fold index, which is used only by *k*-fold cross-validation splits. .. code:: ipython3 # Explore contents of the split file split_df = pd.read_csv(split_file) split_df.head() .. list-table:: :header-rows: 1 :class: tight-table * - - cmpd_id - subset - fold * - 0 - CHEMBL498564 - train - 0 * - 1 - CHEMBL1085567 - train - 0 * - 2 - CHEMBL236473 - train - 0 * - 3 - CHEMBL464422 - train - 0 * - 4 - CHEMBL611677 - train - 0 .. code:: ipython3 # Show the numbers of compounds in each split subset split_df.subset.value_counts() .. parsed-literal:: subset train 1273 valid 273 test 273 Name: count, dtype: int64 Visualizing Scaffold Splits *************************** `Tanimoto distance `_ is a handy way to measure structural dissimilarity between compounds represented using `ECFP fingerprints `_. We can use functions in the ``compare_splits_plots`` module to compute `Tanimoto distance `_ between each validation and test set compound and its nearest neighbor in the training set, and then plot the distribution of distances for each subset. .. code:: ipython3 import seaborn as sns import matplotlib.pyplot as plt import atomsci.ddm.utils.compare_splits_plots as csp # read the dataset df = pd.read_csv('dataset/SLC6A3_Ki_curated.csv') # read the split file split = pd.read_csv(split_file) split_type = params['splitter'] # create SplitStats ss = csp.SplitStats(df, split, smiles_col='base_rdkit_smiles', id_col='compound_id', response_cols=['avg_pKi']) # plot fig, ax = plt.subplots(1,2, sharey=True, figsize=(10,5)) ss.dist_hist_train_v_valid_plot(ax=ax[0]) ax[0].set_title(f"Train vs Valid Tanimoto Dist using {split_type} split") ss.dist_hist_train_v_test_plot(ax=ax[1]) ax[1].set_title(f"Train vs Test Tanimoto Dist using {split_type} split"); .. image:: ../_static/img/02_perform_a_split_files/02_perform_a_split_14_0.png The majority of compounds have `Tanimoto distances `_ between 0.2 and 0.8 from the training set, indicating that they are structurally different from the training compounds. The distance distributions are similar between the test and validation sets. This indicates that a model selected based on its validation set performance will likely have similar performance when evaluated on the test set. We can also plot the distributions of the response values - the :math:`pK_i`'s - in each subset. These plots can be useful in diagnosing model performance problems; if the response distributions in the training and test sets are dramatically different, it may be hard to train a model that performs well on the test set. .. code:: ipython3 import atomsci.ddm.utils.split_response_dist_plots as srdp split_params = { "dataset_key" : dataset_file, "smiles_col" : "base_rdkit_smiles", "response_cols" : "avg_pKi", "split_uuid": split_uuid, "splitter": 'scaffold', } srdp.plot_split_subset_response_distrs(split_params) .. image:: ../_static/img/02_perform_a_split_files/02_perform_a_split_17_0.png For this dataset, the :math:`pK_i`'s have roughly similar distributions across the **scaffold split** subsets, except that the training set has slightly more compounds with large values. In  **Tutorial 3, "Train a Simple Regression Model"**, we will use this dataset and **scaffold split** to train a model to predict the :math:`pK_i`'s. If you have specific feedback about a tutorial, please complete the `AMPL Tutorial Evaluation `_.