################ 01 Data Curation ################ *Published: June, 2024, ATOM DDM Team* Please check out the companion tutorial video: |youtube-image| .. |youtube-image| image:: ../_static/img/youtube_icon.png :alt: Tutorial 01 Data Curation :target: https://www.youtube.com/watch?v=a-uRfjF8izs ------------ To train a machine learning model from data, that data must first be "curated" to ensure that chemical structures and properties are represented consistently. Curating raw data is a long, detailed process that takes several steps. `SMILES `_ strings need to be standardized, measurements need to be converted to common units, outliers need to be removed or corrected, and replicates need to be combined. These steps are vital to create datasets that can be used to train useful predictive models. Here we will cover some functions in `AMPL `_ that will help you to perform these steps. - `base_smiles_from_smiles `_ - `standardize_relations `_ - `compute_negative_log_responses `_ - `remove_outlier_replicates `_ - `aggregate_assay_data `_ These are just a few of the steps needed to curate a dataset. Import Standard Data Science Packages ************************************* To use `AMPL `_, or to do almost anything else with data, you'll need to become familiar with the popular packages `pandas `_, `numpy `_, `matplotlib `_ and `seaborn `_. When you installed `AMPL `_ you will have installed these packages as well, so you simply need to import them here. .. code:: ipython3 import os import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns Read the Data ************* We've prepared an example dataset containing `Ki `_ values for inhibitors of the `SLC6A3 `_ dopamine transporter collected from `ChEMBL `_. This dataset is simpler than most that we find in the wild, but it will let us concisely demonstrate some `AMPL `_ curation tools. The first step of data curation is to read the raw data into a Pandas data frame. .. code:: ipython3 # Read in data raw_df = pd.read_csv('dataset/SLC6A3_Ki.csv') .. code:: ipython3 # Check the number of rows and columns in the dataset raw_df.shape .. parsed-literal:: (2236, 6) .. code:: ipython3 # List the column names raw_df.columns.values .. parsed-literal:: array(['molecule_chembl_id', 'smiles', 'standard_type', 'standard_relation', 'standard_value', 'standard_units'], dtype=object) This dataset is drawn from the `ChEMBL `_ database and contains the following columns: .. list-table:: :header-rows: 1 :class: tight-table * - Column - Description * - `molecule_chembl_id` - The ChEMBL ID for the molecule. * - `smiles` - The `SMILES `_ string that represents the molecule's structure. This is the main input used to derive features for `AMPL `_ models. * - `standard_type` - The type of measurement, e.g., :math:`IC_{50}`, :math:`K_i`, :math:`K_d`, etc. This dataset only contains :math:`K_i` data points. * - `standard_relation` - The relational operator for a measurement reported as "< :math:`X`" or "> :math:`X`", indicating the true value is below or above some limit :math:`X` (e.g., the lowest or highest concentration tested). When this occurs we say the measurement is "left-" or "right-censored". * - `standard_value` - The measured value (or the limit value for a censored measurement). * - `standard_units` - The units of the measurement. :math:`K_i` values may be recorded in different units which will need to be converted to a common unit. The `SLC6A3 `_ dataset contains a mixture of nanomolar (nM) and micromolar (µM) units. Standardize SMILES ****************** The `SMILES `_ grammar allows the same chemical structure to be represented by many different `SMILES `_ strings. In addition, measurements may be performed on compounds with different salt groups or with radioisotope labels, which we treat as equivalent to the base compounds. `AMPL `_ provides a `SMILES `_ standardization function, ``base_smiles_from_smiles``, that removes salt groups and isotopes and returns a unique `SMILES `_ string for each base compound structure. This step simplifies the machine learning problem by ensuring each compound is represented with the same set of features and multiple measurements on the same compound can be grouped together. .. note:: *The input to "base_smiles_from_smiles" must be a list; numpy arrays and pandas Series objects must be converted with the tolist function.* .. code:: ipython3 from atomsci.ddm.utils.struct_utils import base_smiles_from_smiles # Since the base_smiles_from_smiles function can be slow, we specify the workers=8 argument # to divide the work across 8 threads. raw_df['base_rdkit_smiles'] = base_smiles_from_smiles(raw_df.smiles.tolist(), workers=8) .. code:: ipython3 raw_df.smiles.nunique(), raw_df.base_rdkit_smiles.nunique() .. parsed-literal:: (1830, 1823) For this dataset there are 1830 unique `SMILES `_ that are standardized to 1823 unique base `SMILES `_. It is common for two different `SMILES `_ strings to be standardized to the same value. From now on we will use ``base_rdkit_smiles`` to represent compound structures. Calculate :math:`pK_i`'s ------------------------ A :math:`K_i` is an equilibrium constant for the reaction of an inhibitor with a target protein; it is measured in concentration units. Like many other chemical properties, :math:`K_i` values may span several orders of magnitude, from picomolar to millimolar (a billion-fold range). This makes it difficult to fit machine learning models to them because the variance of repeat measurements grows with the measured value, as illustrated in the left hand plot below. We prefer instead to work with :math:`pK_i` values, where :math:`pK_i = -\mathrm{log}_{10} (K_i)` with :math:`K_i` in molar units, because the log transformed measurements have more stable variances, as shown at right. Similar transformations are often applied to properties like :math:`IC_{50}`'s, :math:`K_d`'s and :math:`EC_{50}`'s, yielding :math:`pIC_{50}`'s, :math:`pK_d`'s, and :math:`pEC_{50}`'s. .. image:: ../_static/img/01_data_curation_files/01_data_curation_pki_mean.png .. note:: *For those who want more details: It's hard to fit machine learning (ML) models to raw Ki's because typical training methods seek to minimize a squared-error loss function (the error being the difference between the actual and predicted values). Squared errors tend to scale with the variance among replicates, so the loss function is dominated by the compounds with the largest variance, i.e. those with the largest Ki's. This leads to models that perform OK on the least potent compounds and terribly on the most potent.* The `AMPL `_ function ``compute_negative_log_responses`` performs these variance stabilizing transformations, converting :math:`K_i`'s to :math:`pK_i`'s and so on. The code below uses the units in the ``standard_units`` column and the conversion functions specified in the ``unit_conv`` argument to convert the :math:`K_i`'s in the ``standard_value`` column to molar units before applying the log transformation. It also inverts the ":math:`<`" and ":math:`>`" operators in ``relation_col`` so that they correctly describe the :math:`pK_i` values, which *decrease* as :math:`K_i` values *increase* (e.g., ":math:`K_i > 100 \mathrm{µ}M`" means ":math:`K_i > 10^{-4} \mathrm{M}`" which implies ":math:`pK_i < 4`"). .. code:: ipython3 from atomsci.ddm.utils.data_curation_functions import compute_negative_log_responses raw_df = compute_negative_log_responses(raw_df, unit_col='standard_units', value_col='standard_value', new_value_col='pKi', relation_col='standard_relation', unit_conv={'µM':lambda x: x*1e-6, 'nM':lambda x: x*1e-9}, inplace=False) We then plot histograms to compare the distributions of the raw and transformed :math:`K_i`'s: .. code:: ipython3 _ = raw_df[['standard_value', 'pKi']].hist() .. image:: ../_static/img/01_data_curation_files/01_data_curation_18_0.png Standardize Relations ********************* Some databases may contain measurements reported with a variety of relational operators such as ":math:`>=`", ":math:`<=`", ":math:`~`" and so on. In datasets used to train models, `AMPL `_ expects the relation column to contain one of the three standard operators ":math:`>`", ":math:`<`" or ":math:`=`", or an empty field representing equality. `AMPL `_ provides a ``standardize_relations`` function to coerce nonstandard relations to one of the standard values. We use the ``rel_col`` and ``output_rel_col`` arguments to indicate that the input relations are in the ``standard_relation`` column, and to specify a new column to receive the standardized relations. The ``db=ChEMBL`` argument tells the function to apply ChEMBL-specific formatting changes (such as removing quotes around operators). .. code:: ipython3 from atomsci.ddm.utils.data_curation_functions import standardize_relations raw_df = standardize_relations(raw_df, rel_col='standard_relation', db='ChEMBL', output_rel_col='fixed_relation') .. code:: ipython3 # Look at the operator counts before and after standardization raw_df.standard_relation.value_counts() .. parsed-literal:: standard_relation '=' 1868 '<' 319 = 39 '>' 8 '<=' 2 Name: count, dtype: int64 .. code:: ipython3 raw_df.fixed_relation.value_counts() .. parsed-literal:: fixed_relation = 1907 < 321 > 8 Name: count, dtype: int64 For this dataset, we see that the nonstandard operator ":math:`<=`" was changed to ":math:`<`", and the single quotes around some operators were removed, as we requested. Remove Outliers and Aggregate Replicate Measurements **************************************************** The final step is to remove outliers and aggregate (average) replicate measurements on the same compounds. The function ``remove_outlier_replicates`` is a simple filter that groups measurements by compound, computes the median of each group, and removes values that differ more than ``max_diff_from_median`` units from the median. When the measurements are very spread out relative to ``max_diff_from_median``, *all* the rows for a compound may be deleted from the dataset. The default setting (:math:`1.0`) generally works well for :math:`pK_i` values. The function ``aggregate_assay_data`` replaces multiple replicate measurements for each compound with a single aggregate value. Usually this is simply the average over the replicates, but if the dataset contains both censored and uncensored values for a compound, the function computes a maximum likelihood estimate that takes the censoring into account. .. code:: ipython3 from atomsci.ddm.utils.curate_data import remove_outlier_replicates, aggregate_assay_data curated_df = remove_outlier_replicates(raw_df, id_col='molecule_chembl_id', response_col='pKi', max_diff_from_median=1.0) curated_df = aggregate_assay_data(curated_df, value_col='pKi', output_value_col='avg_pKi', id_col='molecule_chembl_id', smiles_col='base_rdkit_smiles', relation_col='fixed_relation', label_actives=False, verbose=True ) print("Original data shape: ", raw_df.shape) print("Curated data shape: ", curated_df.shape) curated_df.head() .. parsed-literal:: Removed 17 pKi replicate measurements that were > 1.0 from median 9 entries in input table are missing SMILES strings 1819 unique SMILES strings are reduced to 1819 unique base SMILES strings Original data shape: (2236, 9) Curated data shape: (1819, 4) .. list-table:: :header-rows: 1 :class: tight-table * - - compound_id - base_rdkit_smiles - relation - avg_pKi * - 0 - CHEMBL2113217 - C#CCC(C(=O)c1ccc(C)cc1)N1CCCC1 - - 5.636388 * - 1 - CHEMBL220765 - C#CCN1CC[C@@H](Cc2ccc(F)cc2)C[C@@H]1CCCNC(=O)N... - - 6.206908 * - 2 - CHEMBL1945248 - C#CCN1[C@H]2CC[C@@H]1[C@@H](C(=O)OC)[C@@H](c1c... - - 7.849858 * - 3 - CHEMBL1479 - C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=Cc5oncc5C[C@]4(... - - 5.264721 * - 4 - CHEMBL691 - C#C[C@]1(O)CC[C@H]2[C@@H]3CCc4cc(O)ccc4[C@H]3C... - - 6.352617 The data frame returned by ``aggregate_assay_data`` contains only four columns: .. list-table:: :header-rows: 1 :class: tight-table * - Column - Description * - `compound_id` - a unique ID for each base `SMILES `_ string. When multiple values are found in id_col for the same `SMILES `_ string, the function assigns it the first one in lexicographic order. * - `base_rdkit_smiles` - the standardized `SMILES `_ string. * - `relation` - an aggregate relation for the set of replicates * - `avg_pKi` - or whatever you specified in the output_value_col argument, containing the aggregate/average :math:`pK_i` value. .. note:: *When the "label_actives" argument is True (the default), an additional column "active" is added for use in training classification models. We will cover classification models in a future tutorial*. Finally, we save the curated dataset to a CSV file. .. code:: ipython3 curated_df.to_csv('dataset/SLC6A3_Ki_curated.csv', index=False) In **Tutorial 2, "Splitting Datasets for Validation and Testing"**, we'll show how to split this dataset into training, validation and test sets for model training. If you have specific feedback about a tutorial, please complete the `AMPL Tutorial Evaluation `_.