01 Data Curation¶
Published: June, 2024, ATOM DDM Team
Please check out the companion tutorial video: ![]()
To train a machine learning model from data, that data must first be “curated” to ensure that chemical structures and properties are represented consistently. Curating raw data is a long, detailed process that takes several steps. SMILES strings need to be standardized, measurements need to be converted to common units, outliers need to be removed or corrected, and replicates need to be combined. These steps are vital to create datasets that can be used to train useful predictive models. Here we will cover some functions in AMPL that will help you to perform these steps.
These are just a few of the steps needed to curate a dataset.
Import Standard Data Science Packages¶
To use AMPL, or to do almost anything else with data, you’ll need to become familiar with the popular packages pandas, numpy, matplotlib and seaborn. When you installed AMPL you will have installed these packages as well, so you simply need to import them here.
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Read the Data¶
We’ve prepared an example dataset containing Ki values for inhibitors of the SLC6A3 dopamine transporter collected from ChEMBL. This dataset is simpler than most that we find in the wild, but it will let us concisely demonstrate some AMPL curation tools. The first step of data curation is to read the raw data into a Pandas data frame.
# Read in data
raw_df = pd.read_csv('dataset/SLC6A3_Ki.csv')
# Check the number of rows and columns in the dataset
raw_df.shape
(2236, 6)
# List the column names
raw_df.columns.values
array(['molecule_chembl_id', 'smiles', 'standard_type',
'standard_relation', 'standard_value', 'standard_units'],
dtype=object)
This dataset is drawn from the ChEMBL database and contains the following columns:
Column |
Description |
|---|---|
molecule_chembl_id |
The ChEMBL ID for the molecule. |
smiles |
The SMILES string that represents the molecule’s structure. This is the main input used to derive features for AMPL models. |
standard_type |
The type of measurement, e.g., \(IC_{50}\), \(K_i\), \(K_d\), etc. This dataset only contains \(K_i\) data points. |
standard_relation |
The relational operator for a measurement reported as “< \(X\)” or “> \(X\)”, indicating the true value is below or above some limit \(X\) (e.g., the lowest or highest concentration tested). When this occurs we say the measurement is “left-” or “right-censored”. |
standard_value |
The measured value (or the limit value for a censored measurement). |
standard_units |
The units of the measurement. \(K_i\) values may be recorded in different units which will need to be converted to a common unit. The SLC6A3 dataset contains a mixture of nanomolar (nM) and micromolar (µM) units. |
Standardize SMILES¶
The SMILES
grammar allows the same chemical structure to be represented by many
different SMILES
strings. In addition, measurements may be performed on compounds with
different salt groups or with radioisotope labels, which we treat as
equivalent to the base compounds.
AMPL provides a
SMILES
standardization function, base_smiles_from_smiles, that removes salt
groups and isotopes and returns a unique
SMILES
string for each base compound structure. This step simplifies the
machine learning problem by ensuring each compound is represented with
the same set of features and multiple measurements on the same compound
can be grouped together.
Note
The input to “base_smiles_from_smiles” must be a list; numpy arrays and pandas Series objects must be converted with the tolist function.
from atomsci.ddm.utils.struct_utils import base_smiles_from_smiles
# Since the base_smiles_from_smiles function can be slow, we specify the workers=8 argument
# to divide the work across 8 threads.
raw_df['base_rdkit_smiles'] = base_smiles_from_smiles(raw_df.smiles.tolist(), workers=8)
raw_df.smiles.nunique(), raw_df.base_rdkit_smiles.nunique()
(1830, 1823)
For this dataset there are 1830 unique
SMILES
that are standardized to 1823 unique base
SMILES.
It is common for two different
SMILES
strings to be standardized to the same value. From now on we will use
base_rdkit_smiles to represent compound structures.
Calculate \(pK_i\)’s¶
A \(K_i\) is an equilibrium constant for the reaction of an inhibitor with a target protein; it is measured in concentration units. Like many other chemical properties, \(K_i\) values may span several orders of magnitude, from picomolar to millimolar (a billion-fold range). This makes it difficult to fit machine learning models to them because the variance of repeat measurements grows with the measured value, as illustrated in the left hand plot below. We prefer instead to work with \(pK_i\) values, where \(pK_i = -\mathrm{log}_{10} (K_i)\) with \(K_i\) in molar units, because the log transformed measurements have more stable variances, as shown at right. Similar transformations are often applied to properties like \(IC_{50}\)’s, \(K_d\)’s and \(EC_{50}\)’s, yielding \(pIC_{50}\)’s, \(pK_d\)’s, and \(pEC_{50}\)’s.
Note
For those who want more details: It’s hard to fit machine learning (ML) models to raw Ki’s because typical training methods seek to minimize a squared-error loss function (the error being the difference between the actual and predicted values). Squared errors tend to scale with the variance among replicates, so the loss function is dominated by the compounds with the largest variance, i.e. those with the largest Ki’s. This leads to models that perform OK on the least potent compounds and terribly on the most potent.
The AMPL function
compute_negative_log_responses performs these variance stabilizing
transformations, converting \(K_i\)’s to \(pK_i\)’s and so on.
The code below uses the units in the standard_units column and the
conversion functions specified in the unit_conv argument to convert
the \(K_i\)’s in the standard_value column to molar units before
applying the log transformation. It also inverts the “\(<\)” and
“\(>\)” operators in relation_col so that they correctly
describe the \(pK_i\) values, which decrease as \(K_i\) values
increase (e.g., “\(K_i > 100 \mathrm{µ}M\)” means
“\(K_i > 10^{-4} \mathrm{M}\)” which implies “\(pK_i < 4\)”).
from atomsci.ddm.utils.data_curation_functions import compute_negative_log_responses
raw_df = compute_negative_log_responses(raw_df,
unit_col='standard_units',
value_col='standard_value',
new_value_col='pKi',
relation_col='standard_relation',
unit_conv={'µM':lambda x: x*1e-6, 'nM':lambda x: x*1e-9},
inplace=False)
We then plot histograms to compare the distributions of the raw and transformed \(K_i\)’s:
_ = raw_df[['standard_value', 'pKi']].hist()
Standardize Relations¶
Some databases may contain measurements reported with a variety of
relational operators such as “\(>=\)”, “\(<=\)”, “\(~\)” and
so on. In datasets used to train models,
AMPL expects the
relation column to contain one of the three standard operators
“\(>\)”, “\(<\)” or “\(=\)”, or an empty field representing
equality. AMPL
provides a standardize_relations function to coerce nonstandard
relations to one of the standard values. We use the rel_col and
output_rel_col arguments to indicate that the input relations are in
the standard_relation column, and to specify a new column to receive
the standardized relations. The db=ChEMBL argument tells the
function to apply ChEMBL-specific formatting changes (such as removing
quotes around operators).
from atomsci.ddm.utils.data_curation_functions import standardize_relations
raw_df = standardize_relations(raw_df,
rel_col='standard_relation', db='ChEMBL',
output_rel_col='fixed_relation')
# Look at the operator counts before and after standardization
raw_df.standard_relation.value_counts()
standard_relation
'=' 1868
'<' 319
= 39
'>' 8
'<=' 2
Name: count, dtype: int64
raw_df.fixed_relation.value_counts()
fixed_relation
= 1907
< 321
> 8
Name: count, dtype: int64
For this dataset, we see that the nonstandard operator “\(<=\)” was changed to “\(<\)”, and the single quotes around some operators were removed, as we requested.
Remove Outliers and Aggregate Replicate Measurements¶
The final step is to remove outliers and aggregate (average) replicate
measurements on the same compounds. The function
remove_outlier_replicates is a simple filter that groups
measurements by compound, computes the median of each group, and removes
values that differ more than max_diff_from_median units from the
median. When the measurements are very spread out relative to
max_diff_from_median, all the rows for a compound may be deleted
from the dataset. The default setting (\(1.0\)) generally works well
for \(pK_i\) values.
The function aggregate_assay_data replaces multiple replicate
measurements for each compound with a single aggregate value. Usually
this is simply the average over the replicates, but if the dataset
contains both censored and uncensored values for a compound, the
function computes a maximum likelihood estimate that takes the censoring
into account.
from atomsci.ddm.utils.curate_data import remove_outlier_replicates, aggregate_assay_data
curated_df = remove_outlier_replicates(raw_df, id_col='molecule_chembl_id',
response_col='pKi',
max_diff_from_median=1.0)
curated_df = aggregate_assay_data(curated_df,
value_col='pKi',
output_value_col='avg_pKi',
id_col='molecule_chembl_id',
smiles_col='base_rdkit_smiles',
relation_col='fixed_relation',
label_actives=False,
verbose=True
)
print("Original data shape: ", raw_df.shape)
print("Curated data shape: ", curated_df.shape)
curated_df.head()
Removed 17 pKi replicate measurements that were > 1.0 from median
9 entries in input table are missing SMILES strings
1819 unique SMILES strings are reduced to 1819 unique base SMILES strings
Original data shape: (2236, 9)
Curated data shape: (1819, 4)
compound_id |
base_rdkit_smiles |
relation |
avg_pKi |
|
|---|---|---|---|---|
0 |
CHEMBL2113217 |
C#CCC(C(=O)c1ccc(C)cc1)N1CCCC1 |
5.636388 |
|
1 |
CHEMBL220765 |
C#CCN1CC[C@@H](Cc2ccc(F)cc2)C[C@@H]1CCCNC(=O)N… |
6.206908 |
|
2 |
CHEMBL1945248 |
C#CCN1[C@H]2CC[C@@H]1[C@@H](C(=O)OC)[C@@H](c1c… |
7.849858 |
|
3 |
CHEMBL1479 |
C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=Cc5oncc5C[C@]4(… |
5.264721 |
|
4 |
CHEMBL691 |
C#C[C@]1(O)CC[C@H]2[C@@H]3CCc4cc(O)ccc4[C@H]3C… |
6.352617 |
The data frame returned by aggregate_assay_data contains only four
columns:
Column |
Description |
|---|---|
compound_id |
a unique ID for each base SMILES string. When multiple values are found in id_col for the same SMILES string, the function assigns it the first one in lexicographic order. |
base_rdkit_smiles |
the standardized SMILES string. |
relation |
an aggregate relation for the set of replicates |
avg_pKi |
or whatever you specified in the output_value_col argument, containing the aggregate/average \(pK_i\) value. |
Note
When the “label_actives” argument is True (the default), an additional column “active” is added for use in training classification models. We will cover classification models in a future tutorial.
Finally, we save the curated dataset to a CSV file.
curated_df.to_csv('dataset/SLC6A3_Ki_curated.csv', index=False)
In Tutorial 2, “Splitting Datasets for Validation and Testing”, we’ll show how to split this dataset into training, validation and test sets for model training.
If you have specific feedback about a tutorial, please complete the AMPL Tutorial Evaluation.