pipeline package

Submodules

pipeline.ave_splitter module

Code to split a DeepChem dataset in such a way as to minimize the AVE bias, as described in this paper by Wallach & Heifets

Although the AVEMinSplitter class and its methods are public, you will typically not call them directly. Instead, they are invoked by setting splitter to ‘ave_min’ in the model parameters when you train a model.

class pipeline.ave_splitter.AVEMinSplitter(metric='jaccard', verbose=True, num_workers=1, max_iter=300, ndist=100, debug_mode=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class for splitting a DeepChem dataset in order to minimize the Asymmetric Validation Embedding bias.

Uses distances between feature vectors and binary classifications to compute the AVE bias for a candidate split and find a split that minimizes the bias.

Attributes:

metric (str): Name of the metric to be used to compute distances between feature vectors.

verbose (bool): Ignored.

num_workers (int): Number of threads to use to parallelize computations.

max_iter (int): Maximmum number of iterations to execute to try to minimize bias.

ndist (int): Number of points to use to approximate CDF of distance distribution.

debug_mode (bool): If true, generate extra plots and log messages for debugging.

split(dataset, frac_train=0.8, frac_valid=0.2, frac_test=0.0, seed=None, log_every_n=None)[source]

Split dataset into training and validation sets that minimize the AVE bias. A test set is not generated; to do a 3-way split, call this function twice.

Args:

dataset (dc.Dataset): The DeepChem dataset to be split

frac_train (float): The approximate fraction of compounds to put in the training set

frac_valid (float): The approximate fraction of compounds to put in the validation or test set

frac_test (float): Ignored; included only for compatibility with the DeepChem Splitter API

seed (int): Ignored

log_every_n (int or None): Ignored

Returns:

tuple: Lists of indices of compounds assigned to the training and validation/test sets.

The third element of the tuple is an empty list, because this function only does a 2-way split.

Todo:
Change code to do a 3-way split in one call, rather than requiring the distance matrices to be computed twice.
pipeline.ave_splitter.analyze_split(params, id_col='compound_id', smiles_col='rdkit_smiles', active_col='active')[source]

Evaluate the AVE bias for the training/validation and training/test set splits of the given dataset.

Also show the active frequencies in each subset and for the dataset as a whole. id_col, smiles_col and active_col are defaults to be used in case they aren’t found in the dataset metadata; if found the metadata values are used instead.

Args:

params (argparse.Namespace): Pipeline parameters.

id_col (str): Dataset column containing compound IDs.

smiles_col (str): Dataset column containing SMILES strings.

active_col (str): Dataset column containing binary classifications.

Returns:
pandas.DataFrame: Table of split subsets showing sizes, numbers and fractions of active compounds
pipeline.ave_splitter.permutation(x)

Randomly permute a sequence, or return a permuted range.

If x is a multi-dimensional array, it is only shuffled along its first index.

Note

New code should use the permutation method of a default_rng() instance instead; please see the random-quick-start.

x : int or array_like
If x is an integer, randomly permute np.arange(x). If x is an array, make a copy and shuffle the elements randomly.
out : ndarray
Permuted sequence or array range.

Generator.permutation: which should be used for new code.

>>> np.random.permutation(10)
array([1, 7, 4, 3, 0, 9, 2, 5, 8, 6]) # random
>>> np.random.permutation([1, 4, 9, 12, 15])
array([15,  1,  9,  4, 12]) # random
>>> arr = np.arange(9).reshape((3, 3))
>>> np.random.permutation(arr)
array([[6, 7, 8], # random
       [0, 1, 2],
       [3, 4, 5]])
pipeline.ave_splitter.shuffle(x)

Modify a sequence in-place by shuffling its contents.

This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.

Note

New code should use the shuffle method of a default_rng() instance instead; please see the random-quick-start.

x : ndarray or MutableSequence
The array, list or mutable sequence to be shuffled.

None

Generator.shuffle: which should be used for new code.

>>> arr = np.arange(10)
>>> np.random.shuffle(arr)
>>> arr
[1 7 5 2 9 4 3 6 0 8] # random

Multi-dimensional arrays are only shuffled along the first axis:

>>> arr = np.arange(9).reshape((3, 3))
>>> np.random.shuffle(arr)
>>> arr
array([[3, 4, 5], # random
       [6, 7, 8],
       [0, 1, 2]])

pipeline.chem_diversity module

Functions to generate matrices or vectors of distances between compounds

pipeline.chem_diversity.calc_dist_diskdataset(feat_type, dist_met, dataset1, dataset2=None, calc_type='nearest', num_nearest=1, **metric_kwargs)[source]

Returns an array of distances, either between all compounds in a single dataset or between two datasets, given as DeepChem Dataset objects.

Args:

feat_type (str): How the data was featurized. Current options are ‘ECFP’ or ‘descriptors’.

dist_met (str): What distance metric to use. Current options include tanimoto, cosine, cityblock, euclidean, or any other metric supported by scipy.spatial.distance.pdist().

dataset1 (deepchem.Dataset): Dataset containing features of compounds to be compared.

dataset2 (deepchem.Dataset, optional): Second dataset, if two datasets are to be compared.

calc_type (str): Type of summarization to perform on rows of distance matrix. See function calc_summary for options.

num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.

metric_kwargs: Additional arguments to be passed to functions that calculate metrics.

Returns:
np.ndarray: Vector or matrix of distances between feature vectors.
pipeline.chem_diversity.calc_dist_feat_array(feat_type, dist_met, feat1, feat2=None, calc_type='nearest', num_nearest=1, **metric_kwargs)[source]

Returns a vector or array of distances, either between all compounds in a single dataset or between two datasets, given the feature matrices for the dataset(s).

Args:

feat_type (str): How the data was featurized. Current options are ‘ECFP’ or ‘descriptors’.

dist_met (str): What distance metric to use. Current options include tanimoto, cosine, cityblock, euclidean, or any other metric supported by scipy.spatial.distance.pdist().

feat1: feature matrix as a numpy array

feat2: Optional, second feature matrix

calc_type (str): Type of summarization to perform on rows of distance matrix. See function calc_summary for options.

num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.

metric_kwargs: Additional arguments to be passed to functions that calculate metrics.

Returns:
dists: vector or array of distances
pipeline.chem_diversity.calc_dist_smiles(feat_type, dist_met, smiles_arr1, smiles_arr2=None, calc_type='nearest', num_nearest=1, **metric_kwargs)[source]

Returns an array of distances between compounds given as SMILES strings, either between all pairs of compounds in a single dataset or between two datasets.

Args:

feat_type (str): How the data is to be featurized, if dist_met is not ‘mcs’. The only option supported currently is ‘ECFP’.

dist_met (str): What distance metric to use. Current options include ‘tanimoto’ and ‘mcs’.

smiles_arr1 (list): First list of SMILES strings.

smiles_arr2 (list): Optional, second list of SMILES strings. Can have only 1 member if wanting compound to matrix comparison.

calc_type (str): Type of summarization to perform on rows of distance matrix. See function calc_summary for options.

num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.

metric_kwargs: Additional arguments to be passed to functions that calculate metrics.

Returns:
dists: vector or array of distances
Todo:

Fix the function _get_descriptors(), which is broken, and re-enable the ‘descriptors’ option for feat_type. Will need to add a parameter to indicate what kind of descriptors should be computed.

Allow other metrics for ECFP features, as in calc_dist_diskdataset().

pipeline.chem_diversity.calc_summary(dist_arr, calc_type, num_nearest=1, within_dset=False)[source]

Returns a summary of the distances in dist_arr, depending on calc_type.

Args:

dist_arr: (np.array): Either a 2D distance matrix, or a 1D condensed distance matrix (flattened upper triangle).

calc_type (str): The type of summary values to return:

all: The distance matrix itself

nearest: The distances to the num_nearest nearest neighbors of each compound (except compound itself)

nth_nearest: The distance to the num_nearest’th nearest neighbor

avg_n_nearest: The average of the num_nearest nearest neighbor distances

farthest: The distance to the farthest neighbor

avg: The average of all distances for each compound

num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.

within_dset (bool): True if input distances are between compounds in the same dataset.

Returns:

dists (np.array): A numpy array of distances. For calc_type ‘nearest’ with num_nearest > 1, this is a 2D array with a row for each compound; otherwise it is a 1D array.
pipeline.chem_diversity.upload_distmatrix_to_DS(dist_matrix, feature_type, compound_ids, bucket, title, description, tags, key_values, filepath='./', dataset_key=None)[source]

Uploads distance matrix in the data store with the appropriate tags

Args:

dist_matrix (np.ndarray): The distance matrix.

feature_type (str): How the data was featurized.

dist_met (str): What distance metric was used.

compound_ids (list): list of compound ids corresponding to the distance matrix (assumes that distance matrix is square and is the distance between all compounds in a dataset)

bucket (str): bucket the file will be put in

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): List of tags to assign to datastore object.

key_values (dict): Dictionary of key:value pairs to include in the datastore object’s metadata.

filepath (str): local path where you want to store the pickled dataframe

dataset_key (str): If updating a file already in the datastore enter the corresponding dataset_key.
If not, leave as ‘none’ and the dataset_key will be automatically generated.
Returns:
None

pipeline.compare_models module

Functions for comparing and visualizing model performance. Most of these functions rely on ATOM’s model tracker and datastore services, which are not part of the standard AMPL installation, but a few functions will work on collections of models saved as local files.

pipeline.compare_models.copy_best_filesystem_models(result_dir, dest_dir, pred_type, force_update=False)[source]

Identify the best models for each dataset within a result directory tree (e.g. from a hyperparameter search). Copy the associated model tarballs to a destination directory.

Args:

result_dir (str): Path to model training result directory.

dest_dir (str): Path of directory wherre model tarballs will be copied to.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to copy

force_update (bool): If true, overwrite tarball files that already exist in dest_dir.

Returns:
pd.DataFrame: Table of performance metrics for best models.
pipeline.compare_models.del_ignored_params(dictionary, ignored_params)[source]

Deletes ignored parameters from the dictionary if they exist

Args:

dictionary (dict): A dictionary with parameters

ignored_parameters (list(str)): A list of keys potentially in the dictionary

Returns:
None
pipeline.compare_models.extract_collection_perf_metrics(collection_name, output_dir, pred_type='regression')[source]

Obtain list of training datasets with models in the given collection. Get performance metrics for models on each dataset and save them as CSV files in the given output directory.

Args:

collection_name (str): Name of model tracker collection to search for models.

output_dir (str): Directory where tables of performance metrics will be written.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

Returns:
None
pipeline.compare_models.extract_model_and_feature_parameters(metadata_dict)[source]

Given a config file, extract model and featuer parameters. Looks for parameter names that end in *_specific. e.g. nn_specific, auto_featurizer_specific

Args:
model_metadict (dict): Dictionary containing NON-FLATTENED metadata for an AMPL model
Returns:
dictionary containing featurizer and model parameters. Most contain the following keys. [‘max_epochs’, ‘best_epoch’, ‘learning_rate’, ‘layer_sizes’, ‘drop_outs’, ‘rf_estimators’, ‘rf_max_features’, ‘rf_max_depth’, ‘xgb_gamma’, ‘xgb_learning_rate’, ‘featurizer_parameters_dict’, ‘model_parameters_dict’]
pipeline.compare_models.get_best_models_info(col_names=None, bucket='public', pred_type='regression', result_dir=None, PK_pipeline=False, output_dir='/usr/local/data', shortlist_key=None, input_dset_keys=None, save_results=False, subset='valid', metric_type=None, selection_type='max', other_filters={})[source]

Tabulate parameters and performance metrics for the best models, according to a given metric, trained against each specified dataset.

Args:

col_names (list of str): List of model tracker collections to search.

bucket (str): Datastore bucket for training datasets.

pred_type (str): Type of models (regression or classification).

result_dir (list of str): Result directories of the models, if model tracker is not supported.

PK_pipeline (bool): Are we being called from PK pipeline?

output_dir (str): Directory to write output table to.

shortlist_key (str): Datastore key for table of datasets to query models for.

input_dset_keys (str or list of str): List of datastore keys for datasets to query models for. Either shortlist_key or input_dset_keys must be specified, but not both.

save_results (bool): If True, write the table of results to a CSV file.

subset (str): Input dataset subset (‘train’, ‘valid’, or ‘test’) for which metrics are used to select best models.

metric_type (str): Type of performance metric (r2_score, roc_auc_score, etc.) to use to select best models.

selection_type (str): Score criterion (‘max’ or ‘min’) to use to select best models.

other_filters (dict): Additional selection criteria to include in model query.

Returns:
top_models_df (DataFrame): Table of parameters and metrics for best models for each dataset.
pipeline.compare_models.get_best_perf_table(metric_type, col_name=None, result_dir=None, model_uuid=None, metadata_dict=None, PK_pipe=False)[source]

Extract parameters and training run performance metrics for a single model. The model may be specified either by a metadata dictionary, a model_uuid or a result directory; in the model_uuid case, the function queries the model tracker DB for the model metadata. For models saved in the filesystem, can query the performance data from the original result directory, but not from a saved tarball.

Args:

metric_type (str): Performance metric to include in result dictionary.

col_name (str): Collection name containing model, if model is specified by model_uuid.

result_dir (str): result directory of the model, if Model tracker is not supported and metadata_dict not provided.

model_uuid (str): UUID of model to query, if metadata_dict is not provided.

metadata_dict (dict): Full metadata dictionary for a model, including training metrics and dataset metadata.

PK_pipe (bool): If True, include some additional parameters in the result dictionary specific to PK models.

Returns:
model_info (dict): Dictionary of parameter or metric name - value pairs.
Todo:
Add support for models saved as local tarball files.
pipeline.compare_models.get_collection_datasets(collection_name)[source]

Returns a list of unique training datasets used for all models in a given collection.

Args:
collection_name (str): Name of model tracker collection to search for models.
Returns:
list: List of model training (dataset_key, bucket) tuples.
pipeline.compare_models.get_dataset_models(collection_names, filter_dict={})[source]

Query the model tracker for all models saved in the model tracker DB under the given collection names. Returns a dictionary mapping (dataset_key,bucket) pairs to the list of (collection,model_uuid) pairs trained on the corresponding datasets.

Args:

collection_names (list): List of names of model tracker collections to search for models.

filter_dict (dict): Additional filter criteria to use in model query.

Returns:
dict: Dictionary mapping training set (dataset_key, bucket) tuples to (collection, model_uuid) pairs.
pipeline.compare_models.get_filesystem_models(result_dir, pred_type)[source]

Identify all models in result_dir and create perf_result table with ‘tarball_path’ column containing a path to each tarball.

pipeline.compare_models.get_filesystem_perf_results(result_dir, pred_type='classification')[source]

Retrieve metadata and performance metrics for models stored in the filesystem from a hyperparameter search run.

Args:

result_dir (str): Root directory for results from a hyperparameter search training run.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

Returns:
pd.DataFrame: Table of metadata fields and performance metrics.
pipeline.compare_models.get_multitask_perf_from_files(result_dir, pred_type='regression')[source]

Retrieve model metadata and performance metrics stored in the filesystem from a multitask hyperparameter search. Format the per-task performance metrics in a table with a row for each task and columns for each model/subset combination.

Args:

result_dir (str): Path to root result directory containing output from a hyperparameter search run.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

Returns:
pd.DataFrame: Table of model metadata fields and performance metrics.
pipeline.compare_models.get_multitask_perf_from_files_new(result_dir, pred_type='regression')[source]

Retrieve model metadata and performance metrics stored in the filesystem from a multitask hyperparameter search. Format the per-task performance metrics in a table with a row for each task and columns for each model/subset combination.

Args:

result_dir (str): Path to root result directory containing output from a hyperparameter search run.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

Returns:
pd.DataFrame: Table of model metadata fields and performance metrics.
pipeline.compare_models.get_multitask_perf_from_tracker(collection_name, response_cols=None, expand_responses=None, expand_subsets='test', exhaustive=False)[source]

Retrieve full metadata and metrics from model tracker for all models in a collection and format them into a table, including per-task performance metrics for multitask models.

Meant for multitask NN models, but works for single task models as well.

By AKP. Works for model tracker as of 10/2020

Args:

collection_name (str): Name of model tracker collection to search for models.

response_cols (list, str or None): Names of tasks (response columns) to query performance results for.
If None, checks to see if the entire collection has the same response cols. Otherwise, should be list of strings or a comma-separated string. asks for clarification. Note: make sure response cols are listed in same order as in metadata. Recommended: None first, then clarify.
expand_responses (list, str or None): Names of tasks / response columns you want to include results for in
the final dataframe. Useful if you have a lot of tasks and only want to look at the performance of a few of them. Must also be a list or comma separated string, and must be a subset of response_cols. If None, will expand all responses.
expand_subsets (list, str or None): Dataset subsets (‘train’, ‘valid’ and/or ‘test’) to show metrics for.
Again, must be list or comma separated string, or None to expand all.
exhaustive (bool): If True, return large dataframe with all model tracker metadata minus any columns not
in expand_responses. If False, return trimmed dataframe with most relevant columns.
Returns:
pd.DataFrame: Table of model metadata fields and performance metrics.
pipeline.compare_models.get_summary_metadata_table(uuids, collections=None)[source]

Tabulate metadata fields and performance metrics for a set of models identified by specific model_uuids.

Args:

uuids (list): List of model UUIDs to query.

collections (list or str): Names of collections in model tracker DB to get models from. If collections is
a string, it must identify one collection to search for all models. If a list, it must be of the same length as uuids. If not provided, all collections will be searched.
Returns:
pd.DataFrame: Table of metadata fields and performance metrics for models.
pipeline.compare_models.get_summary_perf_tables(collection_names=None, filter_dict={}, result_dir=None, prediction_type='regression', verbose=False)[source]

Load model parameters and performance metrics from model tracker for all models saved in the model tracker DB under the given collection names (or result directory if Model tracker is not available) with the given prediction type. Tabulate the parameters and metrics including:

dataset (assay name, target, parameter, key, bucket) dataset size (train/valid/test/total) number of training folds model type (NN or RF) featurizer transformation type metrics: r2_score, mae_score and rms_score for regression, or ROC AUC for classification
Args:

collection_names (list): Names of model tracker collections to search for models.

filter_dict (dict): Additional filter criteria to use in model query.

result_dir (str or list): Directories to search for models; must be provided if the model tracker DB is not available.

prediction_type (str): Type of models (classification or regression) to query.

verbose (bool): If true, print status messages as collections are processed.

Returns:
pd.DataFrame: Table of model metadata fields and performance metrics.
pipeline.compare_models.get_tarball_perf_table(model_tarball, pred_type='classification')[source]

Retrieve model metadata and performance metrics for a model saved as a tarball (.tar.gz) file.

Args:

model_tarball (str): Path of model tarball file, named as model.tar.gz.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of model.

Returns:
tuple (pd.DataFrame, dict): Table of performance metrics and a dictionary of model metadata.
pipeline.compare_models.get_training_datasets(collection_names)[source]

Query the model tracker DB for all the unique dataset keys and buckets used to train models in the given collections.

Args:
collection_names (list): List of names of model tracker collections to search for models.
Returns:
dict: Dictionary mapping collection names to lists of (dataset_key, bucket) tuples for training sets.
pipeline.compare_models.get_training_perf_table(dataset_key, bucket, collection_name, pred_type='regression', other_filters={})[source]

Load performance metrics from model tracker for all models saved in the model tracker DB under a given collection that were trained against a particular dataset. Identify training parameters that vary between models, and generate plots of performance vs particular combinations of parameters.

Args:

dataset_key (str): Training dataset key.

bucket (str): Training dataset bucket.

collection_name (str): Name of model tracker collection to search for models.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

other_filters (dict): Other filter criteria to use in querying models.

Returns:
pd.DataFrame: Table of models and performance metrics.
pipeline.compare_models.get_umap_nn_model_perf_table(dataset_key, bucket, collection_name, pred_type='regression')[source]

Load performance metrics from model tracker for all NN models with the given prediction_type saved in the model tracker DB under a given collection that were trained against a particular dataset. Show parameter settings for UMAP transformer for models where they are available.

Args:

dataset_key (str): Dataset key for training dataset.

bucket (str): Dataset bucket for training dataset.

collection_name (str): Name of model tracker collection to search for models.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

Returns:
pd.DataFrame: Table of model performance metrics.
pipeline.compare_models.num_trainable_parameters_from_file(tar_path)[source]

Return number of trainable paramters from tarfile

Given a tar file for a DeepChem model this will return the number of trainable parameters

Args:
tar_path (str): Path to a DeepChem model
Returns:
int: Number of trainable parameters.
Raises:
ValueError: If the model is not a DeepChem neural network model

pipeline.dist_metrics module

Distance metrics for compounds: Tanimoto and maximum common substructure (MCS)

pipeline.dist_metrics.mcs(mols1, mols2=None)[source]

Computes maximum common substructure (MCS) distances between pairs of molecules.

The MCS distance between molecules m1 and m2 is one minus the average of fMCS(m1,m2) and fMCS(m2,m1), where fMCS(m1,m2) is the fraction of m1’s atoms that are part of the largest common substructure of m1 and m2.

Args:

mols1 (Sequence of rdkit.Mol): First list of molecules.

mols2 (Sequence of rdkit.Mol, optional): Second list of molecules.
If not provided, computes MCS distances between pairs of molecules in mols1. Otherwise, computes a matrix of distances between pairs of molecules from mols1 and mols2.
Returns:
np.ndarray: Matrix of pairwise distances between molecules.
pipeline.dist_metrics.tanimoto(fps1, fps2=None)[source]

Compute Tanimoto distances between sets of ECFP fingerprints.

Args:

fps1 (Sequence): First list of ECFP fingerprint vectors.

fps2 (Sequence, optional): Second list of ECFP fingerprint vectors.
If not provided, computes distances between pairs of fingerprints in fps1. Otherwise, computes a matrix of distances between pairs of fingerprints in fps1 and fps2.
Returns:
np.ndarray: Matrix of pairwise distances between fingerprints.
pipeline.dist_metrics.tanimoto_single(fp, fps)[source]

Compute a vector of Tanimoto distances between a single fingerprint and each fingerprint in a list .

Args:

fp : Fingerprint to be compared.

fps (Sequence): List of ECFP fingerprint vectors.

Returns:
np.ndarray: Vector of distances between fp and each fingerprint in fps.

pipeline.diversity_plots module

Plotting routines for visualizing chemical diversity of datasets

pipeline.diversity_plots.diversity_plots(dset_key, datastore=True, bucket='public', title_prefix=None, ecfp_radius=4, umap_file=None, out_dir=None, id_col='compound_id', smiles_col='rdkit_smiles', is_base_smiles=False, response_col=None, max_for_mcs=300)[source]

Plot visualizations of diversity for an arbitrary table of compounds. At minimum, the file should contain columns for a compound ID and a SMILES string. Produces a clustered heatmap display of Tanimoto distances between compounds along with a 2D UMAP projection plot based on ECFP fingerprints, with points colored according to the response variable.

Args:

dset_key (str): Datastore key or filepath for dataset.

datastore (bool): Whether to load dataset from datastore or from filesystem.

bucket (str): Name of datastore bucket containing dataset.

title_prefix (str): Prefix for plot titles.

ecfp_radius (int): Radius for ECFP fingerprint calculation.

umap_file (str, optional): Path to file to write UMAP coordinates to.

out_dir (str, optional): Output directory for plots and tables. If provided, plots will be output as PDF files rather
than in the current notebook, and some additional CSV files will be generated.

id_col (str): Column in dataset containing compound IDs.

smiles_col (str): Column in dataset containing SMILES strings.

is_base_smiles (bool): True if SMILES strings do not need to be salt-stripped and standardized.

response_col (str): Column in dataset containing response values.

max_for_mcs (int): Maximum dataset size for plots based on MCS distance. If the number of compounds is less than this
value, an additional cluster heatmap and UMAP projection plot will be produced based on maximum common substructure distance.
pipeline.diversity_plots.plot_dataset_dist_distr(dataset, feat_type, dist_metric, task_name, **metric_kwargs)[source]

Generate a density plot showing the distribution of distances between dataset feature vectors, using the specified feature type and distance metric.

Args:

dataset (deepchem.Dataset): A dataset object. At minimum, it should contain a 2D numpy array ‘X’ of feature vectors.

feat_type (str): Type of features (‘ECFP’ or ‘descriptors’).

dist_metric (str): Name of metric to be used to compute distances; can be anything supported by scipy.spatial.distance.pdist.

task_name (str): Abbreviated name to describe dataset in plot title.

metric_kwargs: Additional arguments to pass to metric.

Returns:
np.ndarray: Distance matrix.
pipeline.diversity_plots.plot_tani_dist_distr(df, smiles_col, df_name, radius=2, subset_col='subset', subsets=False, ref_subset='train', plot_width=6, ndist_max=None, **metric_kwargs)[source]

Generate a density plot showing the distribution of nearest neighbor distances between ecfp feature vectors, using the Tanimoto metric. Optionally split by subset.

Args:

df (DataFrame): A data frame containing, at minimum, a column of SMILES strings.

smiles_col (str): Name of the column containing SMILES strings.

df_name (str): Name for the dataset, to be used in the plot title.

radius (int): Radius parameter used to calculate ECFP fingerprints. The default is 2, meaning that ECFP4 fingerprints are calculated.

subset_col (str): Name of the column containing subset names.

subsets (bool): If True, distances are only calculated for compounds not in the reference subset, and the distances computed are to the nearest neighbors in the reference subset.

ref_subset (str): Reference subset for nearest-neighbor distances, if subsets is True.

plot_width (float): Plot width in inches.

ndist_max (int): Not used, included only for backward compatibility.

metric_kwargs: Additional arguments to pass to metric. Not used, included only for backward compatibility.

Returns:
dist (DataFrame): Table of individual nearest-neighbor Tanimoto distance values. If subsets is True, the table will include a column indicating the subset each compound belongs to.

pipeline.featurization module

Classes providing different methods of featurizing compounds and other data entities

class pipeline.featurization.ComputedDescriptorFeaturization(params)[source]

Bases: pipeline.featurization.DescriptorFeaturization

Subclass for featurizers that support online computation of descriptors, usually given SMILES strings or RDKit Mol objects as input rather than compound IDs. The computed descriptors may be cached and combined with tables of precomputed descriptors to speed up access. Featurized datasets may be persisted to the filesystem or datastore.

Attributes:

Set in __init_:

feat_type (str): Type of featurizer, set in super.(__init__)

descriptor_type (str): The type of descriptor

descriptor_key (str): The path to the descriptor featurization matrix if it saved to a file, or the key to the file in the Datastore

descriptor_base (str/path): The base path to the descriptor featurization matrix

precomp_descr_table (pd.DataFrame): initialized as empty df, will be overridden to contain full descriptor table

compute_descriptors(smiles_df, params)[source]

Compute descriptors for the SMILES strings given in smiles_df.

Args:

smiles_df: DataFrame containing SMILES strings to compute descriptors for.

params (Namespace): Argparse Namespace argument containing the parameters

Returns:
ret_df (DataFrame): Data frame containing the compound IDs, SMILES string and descriptor columns as specified in the current parameters.
compute_moe_descriptors(smiles_df, params)[source]

Compute MOE descriptors for the given list of SMILES strings

Args:

smiles_strs (iterable): SMILES strings to compute descriptors for.

params (Namespace): Argparse Namespace argument containing the parameters.

Returns:

(tuple): Tuple containing:

desc_df (DataFrame): Data frame containing computed descriptors

is_valid (ndarray of bool): True for each input SMILES string that was valid according to RDKit

compute_mordred_descriptors(smiles_strs, params)[source]

Compute Mordred descriptors for the given list of SMILES strings

Args:

smiles_strs (iterable): SMILES strings to compute descriptors for.

params (Namespace): Argparse Namespace argument containing the parameters.

Returns:

(tuple): Tuple containing:

desc_df (DataFrame): Data frame containing computed descriptors

is_valid (ndarray of bool): True for each input SMILES string that was valid according to RDKit

compute_rdkit_descriptors(smiles_df, smiles_col='rdkit_smiles')[source]

Compute RDKit descriptors for the given list of SMILES strings

Args:
smiles_strs: SMILES strings to compute descriptors for.
Returns:

(tuple): Tuple containing:

desc_df (DataFrame): Data frame containing computed descriptors

is_valid (ndarray of bool): True for each input SMILES string that was valid according to RDKit

featurize_data(dset_df, model_dataset)[source]

Perform featurization on the given dataset, by computing descriptors from SMILES strings or matching them to SMILES in precomputed table.

Args:

dset_df (DataFrame): A table of data to be featurized. At minimum, should include columns. for the compound ID and assay value; for some featurizers, must also contain a SMILES string column.

model_dataset (ModelDataset): Object containing the dataset to be featurized

Returns:

Tuple of (features, ids, vals, attr).

features (np.array): Feature matrix.

ids (pd.DataFrame): compound IDs or SMILES strings if needed for splitting.

attr (pd.DataFrame): dataframe containing SMILES strings indexed by compound IDs.

vals (np.array): array of response values

Raises:
Exception: if features is None, featurization failed for the dataset
Side effects:
Loads a precomputed descriptor table and sets self.precomp_descr_table to point to it, if one is specified by params.descriptor_key.
get_featurized_dset_name(dataset_name)[source]

Returns a name for the featurized dataset, for use in filenames or dataset keys. Does not include any path information. May be derived from dataset_name.

Args:
dataset_name (str): Name of the dataset
Returns:
(str): A name for the feauturized dataset
scale_moe_descriptors(desc_df, descr_type)[source]

Scale selected descriptors computed by MOE by dividing their values by the atom count per molecule.

Args:

desc_df (DataFrame): Data frame containing computed descriptors.

descr_type (str): Descriptor type, used to look up expected set of descriptor columns.

Returns:
scaled_df (DataFrame): Data frame with scaled descriptors.
class pipeline.featurization.DescriptorFeaturization(params)[source]

Bases: pipeline.featurization.PersistentFeaturization

Subclass for featurizers that map sets of (usually) precomputed descriptors to compound IDs; the resulting merged dataset is persisted to the filesystem or datastore. Attributes:

Set in __init_: feat_type (str): Type of featurizer, set in super.(__init__)

descriptor_type (str): The type of descriptor

descriptor_key (str): The path to the descriptor featurization matrix if it saved to a file, or the key to the file in the Datastore

descriptor_base (str/path): The base path to the descriptor featurization matrix

precomp_descr_table (pd.DataFrame): initialized as an empty DataFrame, will be overridden to contain the full descriptor table

Class attributes:

supported_descriptor_types

all_desc_col

create_feature_transformer(dataset)[source]

Fit a scaling and centering transformation to the feature matrix of the given dataset, and return a DeepChem transformer object holding its parameters.

Args:
dataset (deepchem.Dataset): featurized dataset
Returns:
(list of DeepChem transformer objects): list of transformers for the feature matrix
desc_type_cols = {}
desc_type_scaled = {}
desc_type_source = {}
extract_prefeaturized_data(merged_dset_df, model_dataset)[source]

Attempts to retrieve prefeaturized data for the given dataset.

Args:

merged_dset_df (pd.DataFrame): dataset merged with the featurizers

model_dataset (ModelDataset): Object containing the dataset to be featurized # TODO: Remove model_dataset call once params.response_cols is properly set

Returns:

Tuple of (features, ids, vals, attr).

features (np.array): Feature matrix.

ids (pd.DataFrame): compound IDs or SMILES strings if needed for splitting.

attr (pd.DataFrame): dataframe containing SMILES strings indexed by compound IDs.

vals (np.array): array of response values.

featurize_data(dset_df, model_dataset)[source]

Perform featurization on the given dataset.

Args:

dset_df (DataFrame): A table of data to be featurized. At minimum, should include columns. for the compound ID and assay value; for some featurizers, must also contain a SMILES string column.

model_dataset (ModelDataset): Object containing the dataset to be featurized # TODO: Remove model_dataset call once params.response_cols is properly set

Returns:

Tuple of (features, ids, vals, attr).

features (np.array): Feature matrix.

ids (pd.DataFrame): compound IDs or SMILES strings if needed for splitting.

attr (pd.DataFrame): dataframe containing SMILES strings indexed by compound IDs.

vals (np.array): array of response values

Raises:
Exception: if features is None, feautirzation failed for the dataset
Side effects:
Overwrites the attribute precomp_descr_table (pd.DataFrame) with the appropriate descriptor table
get_feature_columns()[source]

Returns a list of feature column names associated with this Featurization instance.

Args:
None
Returns:
(list): List of column names of the features, pulled from DescriptorFeaturization attributes
get_feature_count()[source]

Returns the number of feature columns associated with this Featurization instance.

Args:
None
Returns:
(int): Number of feature columns associated with DescriptorFeaturization
get_feature_specific_metadata(params)[source]

Returns a dictionary of parameter settings for this Featurization object that are specific to the feature type.

Args:
params (Namespace): Argparse Namespace argument containing the parameters
get_featurized_data_subdir()[source]

Returns the name of a subdirectory (without any leading path information) in which to store featurized data, if it goes to the filesystem.

Returns:
(str): ‘scaled_descriptors’
get_featurized_dset_name(dataset_name)[source]

Returns a name for the featurized dataset, for use in filenames or dataset keys. Does not include any path information. May be derived from dataset_name.

Args:
dataset_name (str): Name of the dataset
Returns:
(str): A name for the feauturized dataset
classmethod load_descriptor_spec(desc_spec_bucket, desc_spec_key)[source]

Read a descriptor specification table from the datastore or the filesystem. The table is a CSV file with the following columns:

descr_type: A string specifying a descriptor source/program and a subset of descriptor columns

source: Name of the program/package that generates the descriptors

scaled: Binary indicator for whether subset of descriptor values are scaled by molecule’s atom count

descriptors: A semicolon separated list of descriptor columns.

The values in the table are used to set class variables desc_type_cols, desc_type_source and desc_type_scaled.

Args:

desc_spec_bucket : bucket where descriptor spec is located

desc_spec_key: data store key, or full file path to locate descriptor spec object

Returns:
None
Side effects:

Sets the following class variables:

cls.desc_type_cols -> map from decriptor types to their associated descriptor column names

cls.desc_type_source -> map from decriptor types to the program/package that generates them

cls.desc_type_scaled -> map from decriptor types to boolean indicators of whether some descriptor values are scaled.

cls.supported_descriptor_types -> the list of available descriptor types

load_descriptor_table(params)[source]

Load the table of precomputed feature values for the descriptor type specified in params, from the datastore_key or path specified by params.descriptor_key and params.descriptor_bucket. Will try to load the table from the local filesystem if possible, but the table should at least have a metadata record in the datastore. The local file path is the same as descriptor_key on twintron-blue, and may be taken from the LC_path metadata property if it is set.

Args:
params (Namespace): Parameters for the current pipeline instance.
Returns:
None
Side effects:
Overwrites the attribute precomp_descr_table (pd.DataFrame) with the loaded descriptor table. Sets attributes desc_id_col and desc_smiles_col, if possible, based on the datastore metadata for the descriptor table. Otherwise, sets them to reasonable defaults. Note that not all descriptor tables contain SMILES strings, but one is required if the table is to be used with ComputedDescriptorFeaturization.
supported_descriptor_types = []
class pipeline.featurization.DynamicFeaturization(params)[source]

Bases: pipeline.featurization.Featurization

Featurization subclass that supports on-the-fly featurization. Can be used when it is inexpensive to compute the features. Most DeepChem featurizers are handled through this class.

Attributes:
Set in __init__
feat_type (str): Type of featurizer in [‘ecfp’,’graphconv’,’molvae’] featurization_obj: The DeepChem or MoleculeVAEFeaturizer object as determined by feat_type and params
create_feature_transformer(dataset)[source]

Fit a scaling and centering transformation to the feature matrix of the given dataset, and return a DeepChem transformer object holding its parameters.

Args:
dataset (deepchem.Dataset): featurized dataset
Returns:
Empty list since we will not be transforming the features of a DynamicFeaturization object
extract_prefeaturized_data(merged_dset_df, model_dataset)[source]

Attempts to extract prefeaturized data for the given dataset. For dynamic featurizers, we don’t save this data, so this method always returns None.

Args:

merged_dset_df (DataFrame): dataset merged with the featurizers

model_dataset (ModelDataset): Object containing the dataset to be featurized

Returns:
None, None, None, None
featurize(mols)[source]

Calls DeepChem featurize() object

featurize_data(dset_df, model_dataset)[source]

Perform featurization on the given dataset.

Args:

dset_df (DataFrame): A table of data to be featurized. At minimum, should include columns for the compound ID and assay value; for some featurizers, must also contain a SMILES string column. # TODO: remove model_dataset after ensuring response_cols are set correctly.

model_dataset (ModelDataset): Contains the dataset to be featurized

Returns:

Tuple of (features, ids, vals, attr).

features (np.array): Feature matrix.

ids (pd.DataFrame): compound IDs or SMILES strings if needed for splitting.

vals (np.array): array of response values.

attr (pd.DataFrame): dataframe containing SMILES strings indexed by compound IDs.

get_feature_columns()[source]

Returns a list of feature column names associated with this Featurization instance. For DynamicFeaturization, the column names are essentially meaningless, so these will be “c0, c1, … etc.”.

Args:
None
Returns:
(list): List of column names in the format [‘c0’,’c1’, …] of the length of the features
get_feature_count()[source]

Returns the number of feature columns associated with this Featurization instance.

Args:
None
Returns:
(int): The number of feature columns for the DynamicFeaturization subclass, feat_type specific
get_feature_specific_metadata(params)[source]

Returns a dictionary of parameter settings for this Featurization object that are specific to the feature type.

Args:
params (Namespace): Argparse Namespace object containing the parameter list
Returns
dict: Dictionary containing featurizer specific metadata as a subdict under the keys [‘ecfp_specific’,’autoencoder_specific’]
get_featurized_data_subdir()[source]

Returns the name of a subdirectory (without any leading path information) in which to store featurized data, if it goes to the filesystem.

Raises:
Exception: This method is not supported by the DynamicFeaturization subclass
get_featurized_dset_name(dataset_name)[source]

Returns a name for the featurized dataset, for use in filenames or dataset keys. Does not include any path information. May be derived from dataset_name.

Args:
dataset_name (str): Name of the dataset
Raises:
Exception: This method is not supported by the DynamicFeaturization subclass
class pipeline.featurization.Featurization(params)[source]

Bases: object

Abstract base class for featurization code

Attributes:
feat_type (str): Type of featurizer, set in __init__
create_feature_transformer(dataset)[source]

Fit a scaling and centering transformation to the feature matrix of the given dataset, and return a DeepChem transformer object holding its parameters.

Args:
dataset (deepchem.Dataset): featurized dataset
Returns:
Empty list
Raises:
NotImplementedError: Must be implemented by concrete subclasses
extract_prefeaturized_data(merged_dset_df, model_dataset)[source]

Extracts dataset features, values, IDs and attributes from the given prefeaturized data frame. Args:

merged_dset_df (DataFrame): Data frame for the dataset.

model_dataset (ModelDataset): Backpointer to the ModelDataset object for the dataset to be featurized.

Raises:
NotImplementedError: Must be implemented by concrete subclasses
featurize_data(dset_df, model_dataset)[source]

Perform featurization on the given dataset.

Args:

dset_df (DataFrame): A table of data to be featurized. At minimum, should include columns. for the compound ID and assay value; for some featurizers, must also contain a SMILES string column.

model_dataset (ModelDataset): Dataset to be featurized.

Raises:
NotImplementedError: Must be implemented by concrete subclasses
get_feature_columns()[source]

Returns a list of feature column names associated with this Featurization instance.

Args:
None
Raises:
NotImplementedError: Must be implemented by concrete subclasses
get_feature_count()[source]

Returns the number of feature columns associated with this Featurization instance.

Args:
None
Raises:
NotImplementedError: Must be implemented by concrete subclasses
get_feature_specific_metadata(params)[source]

Returns a dictionary of parameter settings for this Featurization object that are specific to the feature type.

Args:
params (Namespace): Contains parameters used to instantiate the featurizer.
get_featurized_data_subdir()[source]

Returns the name of a subdirectory (without any leading path information) in which to store featurized data, if it goes to the filesystem.

Raises:
NotImplementedError: Must be implemented by concrete subclasses
get_featurized_dset_name(dataset_name)[source]

Returns a name for the featurized dataset, for use in filenames or dataset keys. Does not include any path information. May be derived from dataset_name.

Args:
dataset_name (str): Name of the dataset
Raises:
NotImplementedError: Must be implemented by concrete subclasses
class pipeline.featurization.PersistentFeaturization(params)[source]

Bases: pipeline.featurization.Featurization

Subclass for featurizers that support persistent storage of featurized data. Used when computing or mapping the features is CPU- or memory-intensive, e.g. descriptors. Currently DescriptorFeaturization is the only subclass, but others are planned (e.g., UMAPDescriptorFeaturization).

create_feature_transformer(dataset)[source]

Fit a scaling and centering transformation to the feature matrix of the given dataset, and return a DeepChem transformer object holding its parameters.

Args:
dataset (deepchem.Dataset): featurized dataset
extract_prefeaturized_data(merged_dset_df, model_dataset)[source]

Attempts to extract prefeaturized data for the given dataset.

Args:

merged_dset_df (DataFrame): dataset merged with the featurizers

model_dataset (ModelDataset): Object containing the dataset to be featurized

Raises:
NotImplementedError: Currently, only DescriptorFeaturization is supported, is not a generic method
featurize_data(dset_df, model_dataset)[source]

Perform featurization on the given dataset.

Args:

dset_df (DataFrame): A table of data to be featurized. At minimum, should include columns. for the compound ID and assay value; for some featurizers, must also contain a SMILES string column.

model_dataset (ModelDataset): Object containing the dataset to be featurized

Returns:
Tuple of (features, ids, vals, attr).

features (np.array): Feature matrix.

ids (pd.DataFrame): compound IDs or SMILES strings if needed for splitting.

attr (pd.DataFrame): dataframe containing SMILES strings indexed by compound IDs.

vals (np.array): array of response values.

Raises:
NotImplementedError: Currently, only DescriptorFeaturization is supported, is not a generic method
pipeline.featurization.compute_2d_mordred_descrs(mols)[source]

Compute 2D Mordred descriptors only

Args:
mols: List of RDKit mol objects for molecules to compute descriptors for.
Returns:
res_df: DataFrame containing Mordred descriptors for molecules.
pipeline.featurization.compute_all_moe_descriptors(smiles_df, params)[source]

Run MOE to compute all 317 standard descriptors.

Args:

smiles_df (DataFrame): Table containing SMILES strings and compound IDs

params (Namespace): Parsed model parameters, used to identify SMILES and compund ID columns.

Returns:
descr_df (DataFrame): Table containing the input SMILES strings and compound IDs, the “washed” SMILES string prepared by MOE, a sequence index, and columns for each MOE descriptor.
pipeline.featurization.compute_all_mordred_descrs(mols, max_cpus=None, quiet=True)[source]

Compute all Mordred descriptors, including 3D ones

Args:

mols: List of RDKit mol objects for molecules to compute descriptors for.

max_cpus: Max number of cores to use for computing descriptors. None means use all available cores.

quiet: If True, avoid displaying progress indicators for computations.

Returns:
res_df: DataFrame containing Mordred descriptors for molecules.
pipeline.featurization.compute_all_rdkit_descrs(mol_df, mol_col='mol')[source]

Compute all RDKit descriptors

Args:
mols: List of RDKit Mol objects to compute descriptors for.
Returns:
res_df (DataFrame): Data frame containing computed descriptors.
pipeline.featurization.compute_mordred_descriptors_from_smiles(smiles_strs, max_cpus=None, quiet=True, smiles_col='rdkit_smiles')[source]

Compute 2D and 3D Mordred descriptors for the given list of SMILES strings.

Args:

smiles_strs: A list or array of SMILES strings

max_cpus: The maximum number of cores to use for computing descriptors. The default value None means
that all available cores will be used.

quiet (bool): If True, suppress displaying a progress indicator while computing descriptors.

smiles_col (str): The name of the column that will contain SMILES strings in the returned data frame.

Returns: tuple
desc_df (DataFrame): A table of Mordred descriptors for the input SMILES strings that were valid
(according to RDKit), together with those SMILES strings.
is_valid (ndarray of bool): An array of length the same as smiles_strs, indicating which SMILES strings
were considered valid.
pipeline.featurization.compute_rdkit_descriptors_from_smiles(smiles_strs, smiles_col='rdkit_smiles')[source]

Compute 2D and 3D RDKit descriptors for the given list of SMILES strings.

Args:

smiles_strs: A list or array of SMILES strings

max_cpus: The maximum number of cores to use for computing descriptors. The default value None means
that all available cores will be used.

quiet (bool): If True, suppress displaying a progress indicator while computing descriptors.

smiles_col (str): The name of the column that will contain SMILES strings in the returned data frame.

Returns: tuple
desc_df (DataFrame): A table of Mordred descriptors for the input SMILES strings that were valid
(according to RDKit), together with those SMILES strings.
is_valid (ndarray of bool): An array of length the same as smiles_strs, indicating which SMILES strings
were considered valid.
pipeline.featurization.create_featurization(params)[source]

Factory method to create the appropriate type of Featurization object for params.featurizer

Args:
params (argparse.Namespace: Object containing the parameter list
Returns:
Featurization object of the correct subclass as specified by params.featurizer
Raises:
ValueError: If params.featurizer not in [‘ecfp’,’graphconv’,’molvae’,’computed_descriptors’,’descriptors’]
pipeline.featurization.featurize_smiles(df, featurizer, smiles_col, log_every_N=1000)[source]

Replacement for DeepChem 2.1 featurize_smiles_df function, which is buggy. Computes features using featurizer for dataframe df column given by smiles_col. Returns them as a numpy array, along with an array ‘is_valid’ indicating which rows of the input dataframe yielded valid features.

pipeline.featurization.get_2d_mols(smiles_strs)[source]

Convert SMILES strings to RDKit Mol objects without explicit hydrogens or 3D coordinates

Args:
smiles_strs (iterable of str): List of SMILES strings to convert
Returns:
tuple (mols, is_valid):
mols (ndarray of Mol): Mol objects for valid SMILES strings only is_valid (ndarray of bool): True for each input SMILES string that was valid according to RDKit
pipeline.featurization.get_3d_mols(smiles_strs)[source]

Convert SMILES strings to Mol objects with explicit hydrogens and 3D coordinates

Args:
smiles_strs (iterable of str): List of SMILES strings to convert
Returns:
tuple (mols, is_valid):
mols (ndarray of Mol): Mol objects for valid SMILES strings only is_valid (ndarray of bool): True for each input SMILES string that was valid according to RDKit
pipeline.featurization.get_dataset_attributes(dset_df, params)[source]

Construct a table mapping compound IDs to SMILES strings and possibly other attributes (e.g., dates) specified in params.

Args:

dset_df (DataFrame): The dataset table

params (Namespace): Parsed parameters. The id_col and smiles_col parameters are used to specify the columns in dset_df containing compound IDs and SMILES strings, respectively. If the parameter date_col is not None, it is used to specify a column of datetime strings associated with each compound.

Returns:
attr_df (DataFrame): A table of SMILES strings and (optionally) other attributes, indexed by compound_id.
pipeline.featurization.get_mordred_calculator(exclude=['EState', 'MolecularDistanceEdge'], ignore_3D=False)[source]

Create a Mordred calculator with all descriptor modules registered except those whose names are in the exclude list. Register ATOM versions of the classes in those modules instead.

Args:

exclude (list): List of Mordred descriptor modules to exclude.

ignore_3D (bool): Whether to exclude descriptors that require computing 3D structures.

Returns:
calc (mordred.Calculator): Object for performing Mordred descriptor calculations.
pipeline.featurization.get_rdkit_calculator(desc_list)[source]

Create a Mordred calculator with only the RDKit wrapper descriptor modules registered

pipeline.featurization.get_user_specified_features(df, featurizer, verbose=False)[source]

Temp fix for DC 2.3 issue. See https://github.com/deepchem/deepchem/issues/1841

pipeline.featurization.make_weights(vals)[source]

In the case of multitask learning, we must create weights for each sample labeling it with 0 if there is not value or 1 if there is.

Args:
vals: numpy array containing nans where there are not labels
Returns:
vals: numpy array same as input vals, but nans are replaced with 0 w: numpy array same shape as vals, where w[i,j] = 1 if vals[i,j] is nan else w[i,j] = 0
pipeline.featurization.remove_duplicate_smiles(dset_df, smiles_col='rdkit_smiles')[source]

Remove any rows with duplicate SMILES strings from the given dataset.

Args:

dset_df (DataFrame): The dataset table.

smiles_col (str): The column containing SMILES strings.

Returns:
filtered_dset_df (DataFrame): The dataset filtered to remove duplicate SMILES strings.

pipeline.feature_importance module

Functions to assess feature importance in AMPL models

pipeline.feature_importance.base_feature_importance(model_pipeline=None, params=None)[source]

Minimal baseline feature importance function. Given an AMPL model (or the parameters to train a model), returns a data frame with a row for each feature. The columns of the data frame depend on the model type and prediction type. If the model is a binary classifier, the columns include t-statistics and p-values for the differences between the means of the active and inactive compounds. If the model is a random forest, the columns will include the mean decrease in impurity (MDI) of each feature, computed by the scikit-learn feature_importances_ function. See the scikit-learn documentation for warnings about interpreting the MDI importance. For all models, the returned data frame will include feature names, means and standard deviations for each feature.

This function has been tested on RFs and NNs with rdkit descriptors. Other models and feature combinations may not be supported.

Args:

model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.

params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.

Returns:
(imp_df, model_pipeline, pparams) (tuple):
imp_df (DataFrame): Table of feature importance metrics. model_pipeline (ModelPipeline): Pipeline object for model that was passed to or trained by function. pparams (Namespace): Parsed parameters of model.
pipeline.feature_importance.cluster_permutation_importance(model_pipeline=None, params=None, score_type=None, clust_height=1, result_file=None, nreps=10, nworkers=1)[source]

Divide the input features used in a model into correlated clusters, then assess the importance of the features by iterating over clusters, permuting the values of all the features in the cluster, and measuring the effect on the model performance metric given by score_type for the training, validation and test subsets.

Args:

model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.

params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.

clust_height (float): Height at which to cut the dendrogram branches to split features into clusters.

result_file (str): Path to a CSV file where a table of features and cluster indices will be written.

nreps (int): Number of repetitions of the permutation and rescoring procedure to perform for each feature; the importance values returned will be averages over repetitions. More repetitions will yield better importance estimates at the cost of greater computing time.

nworkers (int): Number of parallel worker threads to use for permutation and rescoring. Currently ignored; multithreading will be added in a future version.

Returns:
imp_df (DataFrame): Table of feature clusters and importance values
pipeline.feature_importance.display_feature_clusters(model_pipeline=None, params=None, clust_height=1, corr_file=None, show_matrix=False, show_dendro=True)[source]

Cluster the input features used in the model specified by model_pipeline or params, using Spearman correlation as a similarity metric. Display a dendrogram and/or a correlation matrix heatmap, so the user can decide the height at which to cut the dendrogram in order to split the features into clusters, for input to cluster_permutation_importance.

Args:

model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.

params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.

clust_height (float): Height at which to draw a cut line in the dendrogram, to show how many clusters will be generated.

corr_file (str): Path to an optional CSV file to be created containing the feature correlation matrix.

show_matrix (bool): If True, plot a correlation matrix heatmap.

show_dendro (bool): If True, plot the dendrogram.

Returns:
corr_linkage (np.ndarray): Linkage matrix from correlation clustering
pipeline.feature_importance.permutation_feature_importance(model_pipeline=None, params=None, score_type=None, nreps=60, nworkers=1, result_file=None)[source]

Assess the importance of each feature used by a trained model by permuting the values of each feature in succession in the training, validation and test sets, making predictions, computing performance metrics, and measuring the effect of scrambling each feature on a particular metric.

Args:

model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.

params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.

score_type (str): Name of the scoring metric to use to assess importance. This can be any of the standard values supported by sklearn.metrics.get_scorer; the AMPL-specific values ‘npv’, ‘mcc’, ‘kappa’, ‘mae’, ‘rmse’, ‘ppv’, ‘cross_entropy’, ‘bal_accuracy’ and ‘avg_precision’ are also supported. Score types for which smaller values are better, such as ‘mae’, ‘rmse’ and ‘cross_entropy’ are mapped to their negative counterparts.

nreps (int): Number of repetitions of the permutation and rescoring procedure to perform for each feature; the importance values returned will be averages over repetitions. More repetitions will yield better importance estimates at the cost of greater computing time.

nworkers (int): Number of parallel worker threads to use for permutation and rescoring.

result_file (str): Optional path to a CSV file to which the importance table will be written.

Returns:
imp_df (DataFrame): Table of features and importance metrics. The table will include the columns returned by base_feature_importance, along with the permutation importance scores for each feature for the training, validation and test subsets.
pipeline.feature_importance.plot_feature_importances(imp_df, importance_col='valid_perm_importance_mean', max_feat=20, ascending=False)[source]

Display a horizontal bar plot showing the relative importances of the most important features or feature clusters, according to the results of permutation_feature_importance, cluster_permutation_importance or a similar function.

Args:

imp_df (DataFrame): Table of results from permutation_feature_importance, cluster_permutation_importance, base_feature_importance or a similar function.

importance_col (str): Name of the column in imp_df to plot values from.

max_feat (int): The maximum number of features or feature clusters to plot values for.

ascending (bool): Should the features be ordered by ascending values of importance_col? Defaults to False; can be set True for p-values or something else where small values mean greater importance.

Returns:
None

pipeline.model_datasets module

Classes for dealing with datasets for data-driven modeling.

class pipeline.model_datasets.DatastoreDataset(params, featurization=None, ds_client=None)[source]

Bases: pipeline.model_datasets.ModelDataset

Subclass representing a dataset for data-driven modeling that lives in the datastore.

Attributes:

set in __init__:

params (Namespace object): contains all parameter information

log (logger object): logger for all warning messages.

dataset_name (str): set from the parameter object, the name of the dataset

output_dit (str): The root directory for saving output files

split_strategy (str): the flag for determining the split strategy (e.g. ‘train_test_valid’,’k-fold’)

featurization (Featurization object): The featurization object created by ModelDataset or input as an optional argument in the factory function.

splitting (Splitting object): A splitting object created by the ModelDataset intiailization method

combined_train_valid_data (dc.DiskDataset): A dataset object (initialized as None), of the merged train and valid splits ds_client (datastore client):

set in get_featurized_data:

dataset: A new featurized DeepChem DiskDataset.

n_features: The count of features (int)

vals: The response col after featurization (np.array)

attr: A pd.dataframe containing the compound ids and smiles

set in get_dataset_tasks:
tasks (list): list of prediction task columns
set in split_dataset or load_presplit_dataset:

train_valid_dsets: A list of tuples of (training,validation) DeepChem Datasets

test_dset: (dc.data.Dataset): The test dataset to be held out

train_valid_attr: A list of tuples of (training,validation) attribute DataFrames

test_attr: The attribute DataFrame for the test set, containing compound IDs and SMILES strings.

set in load_full_dataset()
dataset_key (str): The datastore key pointing to the dataset
get_dataset_tasks(dset_df)[source]

Sets self.tasks to the list of prediction task columns defined for this dataset. If the dataset is in the datastore, these should be available in the metadata. Otherwise we guess by looking at the column names in dset_df and excluding features, compound IDs, SMILES string columns, etc.

Args:
dset_df (pd.DataFrame): Dataset containing the prediction tasks
Returns:
Success (bool): Returns true if task names are retrieved.
Side effects:
Sets the task attribute of the DatastoreDataset object to a list of task names.
load_dataset_split_table(directory=None)[source]

Loads from the datastore a table of compound IDs assigned to each split subset of a dataset. Called by load_presplit_dataset().

Args:
directory: Ignored; included only for compatibility with the FileDataset version of this method.
Returns:
tuple(split_df, split_kv):

split_df (DataFrame): Table assigning compound IDs to split subsets and folds. split_kv (dict): Dictionary of key-value pairs from the split table metadata; includes all the

parameters that were used to define the split.
load_featurized_data()[source]

Loads prefeaturized data from the datastore. Returns a data frame, which is then passed to featurization.extract_prefeaturized_data() for processing.

Returns:
featurized_dset_df (pd.DataFrame): dataframe of the prefeaturized data, needs futher processing
load_full_dataset()[source]

Loads the dataset from the datastore

Returns:
dset_df: Dataset as a DataFrame
Raises:
Exception if dset_df is None or empty due to error in loading in dataset.
save_featurized_data(featurized_dset_df)[source]

Save a featurized dataset to the datastore

Args:
featurized_dset_df: DataFrame containing the featurized dataset.
Returns:
None
save_split_dataset(directory=None)[source]

Saves a table of compound IDs assigned to each split subset of a dataset.

Args:
directory: Ignored; included only for compatibility with the FileDataset version of this method.
class pipeline.model_datasets.FileDataset(params, featurization)[source]

Bases: pipeline.model_datasets.ModelDataset

Subclass representing a dataset for data-driven modeling that lives in the filesystem.

Attributes:

set in __init__:

params (Namespace object): contains all parameter information

log (logger object): logger for all warning messages.

dataset_name (str): set from the parameter object, the name of the dataset

output_dit (str): The root directory for saving output files

split_strategy (str): the flag for determining the split strategy (e.g. ‘train_test_valid’,’k-fold’)

featurization (Featurization object): The featurization object created by ModelDataset or input as an optional argument in the factory function.

splitting (Splitting object): A splitting object created by the ModelDataset intiailization method

combined_train_valid_data (dc.DiskDataset): A dataset object (initialized as None), of the merged train and valid splits

ds_client (datastore client):

set in get_featurized_data:

dataset: A new featurized DeepChem DiskDataset.

n_features: The count of features (int)

vals: The response col after featurization (np.array)

attr: A pd.dataframe containing the compound ids and smiles

set in get_dataset_tasks:
tasks (list): list of prediction task columns
set in split_dataset or load_presplit_dataset:

train_valid_dsets: A list of tuples of (training,validation) DeepChem Datasets

test_dset: (dc.data.Dataset): The test dataset to be held out

train_valid_attr: A list of tuples of (training,validation) attribute DataFrames

test_attr: The attribute DataFrame for the test set, containing compound IDs and SMILES strings.

get_dataset_tasks(dset_df)[source]

Returns the list of prediction task columns defined for this dataset. If the dataset is in the datastore, these should be available in the metadata. Otherwise you can guess by looking at the column names in dset_df and excluding features, compound IDs, SMILES string columns, etc.

Args:
dset_df (pd.DataFrame): Dataset as a DataFrame that contains columns for the prediction tasks
Returns:
sucess (boolean): True is self.tasks is set. False if not user supplied.
Side effects:
Sets the self.tasks attribute of FileDataset to be the list of prediction task columns
load_dataset_split_table(directory=None)[source]

Loads from the filesystem a table of compound IDs assigned to each split subset of a dataset. Called by load_presplit_dataset().

Args:
directory (str): Directory where the split table will be created. Defaults to the directory of the current dataset.
Returns:
split_df (DataFrame): Table assigning compound IDs to split subsets and folds. split_kv: None for the FileDataset version of this method.
load_featurized_data()[source]

Loads prefeaturized data from the filesystem. Returns a data frame, which is then passed to featurization.extract_prefeaturized_data() for processing.

Returns:
featurized_dset_df (pd.DataFrame): dataframe of the prefeaturized data, needs futher processing
load_full_dataset()[source]

Loads the dataset from the file system.

Returns:
dset_df: Dataset as a DataFrame loaded in from a CSV or feather file
Raises:
exception: if dataset is empty or failed to load
save_featurized_data(featurized_dset_df)[source]

Save a featurized dataset to the filesystem.

Args:
featurized_dset_df (pd.DataFrame): Dataset as a DataFrame that contains the featurized data
save_split_dataset(directory=None)[source]

Saves a table of compound IDs and split subset assignments for the current dataset.

Args:
directory (str): Directory where the split table will be created. Defaults to the directory of the current dataset.
class pipeline.model_datasets.MinimalDataset(params, featurization, contains_responses=False)[source]

Bases: pipeline.model_datasets.ModelDataset

A lightweight dataset class that does not support persistence or splitting, and therefore can be used for predictions with an existing model, but not for training a model. Is not expected to contain response columns, i.e. the ground truth is assumed to be unknown.

Attributes:

set in __init__:
params (Namespace object): contains all parameter information log (logger object): logger for all warning messages. featurization (Featurization object): The featurization object created by ModelDataset or input as an optional argument in the factory function.
set in get_featurized_data:
dataset: A new featurized DeepChem DiskDataset. n_features: The count of features (int) attr: A pd.dataframe containing the compound ids and smiles
set in get_dataset_tasks:
tasks (list): list of prediction task columns
get_dataset_tasks(dset_df)[source]

Sets self.tasks to the list of prediction task columns defined for this dataset. These should be defined in the params.response_cols list that was provided when this object was created.

Args:
dset_df (pd.DataFrame): Ignored in this version.
Returns:
Success (bool): Returns true if task names are retrieved.
Side effects:
Sets the task attribute of the MinimalDataset object to a list of task names.
get_featurized_data(dset_df, is_featurized=False)[source]

Featurizes the compound data provided in data frame dset_df, and creates an associated DeepChem Dataset object.

Args:

dset_df (DataFrame): DataFrame either with compound id and smiles string or Feature matrix

is_featurized (Boolean): boolean specifying whether the dset_df is already featurized

Returns:
None
Side effects:

Sets the following attributes in the ModelDataset object:

dataset: A new featurized DeepChem Dataset.

n_features: The count of features (int)

attr: A pd.dataframe containing the compound ids and smiles

save_featurized_data(featurized_dset_df)[source]

Does nothing, since a MinimalDataset object does not persist its data.

Args:
featurized_dset_df (pd.DataFrame): Ignored.
class pipeline.model_datasets.ModelDataset(params, featurization)[source]

Bases: object

Base class representing a dataset for data-driven modeling. Subclasses are specialized for dealing with dataset objects persisted in the datastore or in the filesystem.

Attributes:

set in __init__:

params (Namespace object): contains all parameter information

log (logger object): logger for all warning messages.

dataset_name (str): set from the parameter object, the name of the dataset

output_dit (str): The root directory for saving output files

split_strategy (str): the flag for determining the split strategy (e.g. ‘train_test_valid’,’k-fold’)

featurization (Featurization object): The featurization object created by ModelDataset or input as an argument in the factory function.

splitting (Splitting object): A splitting object created by the ModelDataset intiailization method

combined_train_valid_data (dc.DiskDataset): A dataset object (initialized as None), of the merged train and valid splits

set in get_featurized_data:

dataset: A new featurized DeepChem DiskDataset.

n_features: The count of features (int)

vals: The response col after featurization (np.array)

attr: A pd.dataframe containing the compound ids and smiles

set in get_dataset_tasks:
tasks (list): list of prediction task columns
set in split_dataset or load_presplit_dataset:

train_valid_dsets: A list of tuples of (training,validation) DeepChem Datasets

test_dset: (dc.data.Dataset): The test dataset to be held out

train_valid_attr: A list of tuples of (training,validation) attribute DataFrames

test_attr: The attribute DataFrame for the test set, containing compound IDs and SMILES strings.

check_task_columns(dset_df)[source]

Check that the data frame dset_df contains columns for the requested prediction tasks.

Args:
dset_df (pd.DataFrame): Dataset as a DataFrame that contains columns for the prediction tasks
Raises:
Exception:
If self.get_dataset_tasks(dset_df) cannot retrieve prediction tasks for the dtaset If missing prediction task columns from the input dataset
combined_training_data()[source]

Returns a DeepChem Dataset object containing data for the combined training & validation compounds.

Returns:
combined_dataset (dc.data.DiskDataset): Dataset containing the combined training and validation data.
Side effects:
Overwrites the combined_train_valid_data attribute of the ModelDataset with the combined data
create_dataset_split_table()[source]

Generates a data frame containing the information needed to reconstruct the current train/valid/test or k-fold split.

Returns:
split_df (DataFrame): Table with one row per compound in the dataset, with columns:

cmpd_id: Compound ID

subset: The subset the compound was assigned to in the split. Either ‘train’, ‘valid’,
‘test’, or ‘train_valid’. ‘train_valid’ is used for a k-fold split to indicate that the compound was rotated between training and validation sets.
fold: For a k-fold split, an integer indicating the fold in which the compound was in
the validation set. Is zero for compounds in the test set and for all compounds when a train/valid/test split was used.
get_dataset_tasks(dset_df)[source]

Sets self.tasks to the list of prediction task columns defined for this dataset. If the dataset is in the datastore, these should be available in the metadata. Otherwise we guess by looking at the column names in dset_df and excluding features, compound IDs, SMILES string columns, etc.

Args:
dset_df (pd.DataFrame): Dataset as a DataFrame that contains columns for the prediction tasks
Returns:
sucess (boolean): True is self.tasks is set. False if not user supplied.
Side effects:
Sets the self.tasks attribute to be the list of prediction task columns
get_featurized_data()[source]

Does whatever is necessary to prepare a featurized dataset. Loads an existing prefeaturized dataset if one exists and if parameter previously_featurized is set True; otherwise loads a raw dataset and featurizes it. Creates an associated DeepChem Dataset object.

Side effects:
Sets the following attributes in the ModelDataset object:
dataset: A new featurized DeepChem DiskDataset. n_features: The count of features (int) vals: The response col after featurization (np.array) attr: A pd.dataframe containing the compound ids and smiles
get_split_metadata()[source]

Creates a dictionary of the parameters related to dataset splitting, to be saved in the model tracker along with the other metadata needed to reproduce a model run.

Returns:
dict: A dictionary containing the data needed to reproduce the current dataset training/validation/test splits, including the lists of compound IDs for each split subset
get_subset_responses_and_weights(subset, transformers)[source]

Returns a dictionary mapping compound IDs in the given dataset subset to arrays of response values and weights. Used by the perf_data module under k-fold CV.

Args:

subset (string): Label of subset, ‘train’, ‘test’, or ‘valid’

transformers: Transformers object for full dataset

Returns:
tuple(response_dict, weight_dict)
(response_dict): dictionary mapping compound ids to arrays of per-task untransformed response values (weight_dict): dictionary mapping compound ids to arrays of per-task weights
has_all_feature_columns(dset_df)[source]

Compare the columns in dataframe dset_df against the feature columns required by the current featurization and descriptor_type param. Returns True if dset_df contains all the required columns.

Args:
dset_df (DataFrame): Feature matrix
Returns:
(Boolean): boolean specifying whether there are any missing columns in dset_df
load_featurized_data()[source]

Loads prefeaturized data from the datastore or filesystem. Returns a data frame, which is then passed to featurization.extract_prefeaturized_data() for processing.

Raises:
NotImplementedError: The method is implemented by subclasses
load_full_dataset()[source]

Loads the dataset from the datastore or the file system.

Raises:
NotImplementedError: The method is implemented by subclasses
load_presplit_dataset(directory=None)[source]

Loads a table of compound IDs assigned to split subsets, and uses them to split the currently loaded featurized dataset.

Args:

directory (str): Optional directory where the split table is stored; used only by FileDataset.

Defaults to the directory containing the current dataset.

Returns:

success (boolean): True if the split table was loaded successfully and used to split the dataset.

Side effects:
Sets the following attributes of the ModelDataset object

train_valid_dsets: A list of tuples of (training,validation) DeepChem Datasets

test_dset: (dc.data.Dataset): The test dataset to be held out

train_valid_attr: A list of tuples of (training,validation) attribute DataFrames

test_attr: The attribute DataFrame for the test set, containing compound IDs and SMILES strings.

Raises:
Exception: Catches exceptions from split.select_dset_by_attr_ids
or from other errors while splitting dataset using metadata
split_dataset()[source]

Splits the dataset into paired training/validation and test subsets, according to the split strategy selected by the model params. For traditional train/valid/test splits, there is only one training/validation pair. For k-fold cross-validation splits, there are k different train/valid pairs; the validation sets are disjoint but the training sets overlap.

Side effects:
Sets the following attributes in the ModelDataset object:

train_valid_dsets: A list of tuples of (training,validation) DeepChem Datasets

test_dset: (dc.data.Dataset): The test dataset to be held out

train_valid_attr: A list of tuples of (training,validation) attribute DataFrames

test_attr: The attribute DataFrame for the test set, containing compound IDs and SMILES strings.

pipeline.model_datasets.create_minimal_dataset(params, featurization, contains_responses=False)[source]

Create a MinimalDataset object for non-persistent data (e.g., a list of compounds or SMILES strings or a data frame). This object will be suitable for running predictions on a pretrained model, but not for training.

Args:

params (Namespace object): contains all parameter information.

featurization (Featurization object): The featurization object created by ModelDataset or input as an argument in the factory function.

contains_responses (Boolean): Boolean specifying whether the dataset has a column with response values

Returns:
(MinimalDataset): a new MinimalDataset object
pipeline.model_datasets.create_model_dataset(params, featurization, ds_client=None)[source]

Factory function for creating DatastoreDataset or FileDataset objects.

Args:

params (Namespace object): contains all parameter information.

featurization (Featurization object): The featurization object created by ModelDataset or input as an argument in the factory function.

ds_client (Datastore client)

Returns:
either (DatastoreDataset) or (FileDataset): instantiated ModelDataset subclass specified by params
pipeline.model_datasets.create_split_dataset_from_metadata(model_metadata, ds_client, save_file=False)[source]

Function that pulls the split metadata from the datastore and then joins that info with the dataset itself. Args:

model_metadata (Namespace): Namespace object of model metadata

ds_client: datastore client

save_file (Boolean): Boolean specifying whether we want to save split dataset to disk

Returns:
(DataFrame): DataFrame with subset information and response column
pipeline.model_datasets.key_value_list_to_dict(kvp_list)[source]

Convert a key-value pair list from the datastore metadata into a proper dictionary

Args:
kvp_list (list): List of key-value pairs
Returns:
(dictionary): the kvp-list reformatted as a dictionary
pipeline.model_datasets.save_joined_dataset(joined_dataset, split_metadata)[source]

DEPRECATED: Refers to absolute file paths that no longer exist.

Args:

joined_dataset (DataFrame): DataFrame containing split information with the response column

split_metadata (dictionary): Dictionary containing metadata with split info

Returns:
None
pipeline.model_datasets.set_group_permissions(system, path, data_owner='public', data_owner_group='public')[source]

Set file group and permissions to standard values for a dataset containing proprietary or public data, as indicated by ‘data_owner’.

Args:

system (string): Determine the group ownership (at the moment ‘LC’, ‘AD’)

path (string): File path

data_owner (string): Who the data belongs to, either ‘public’ or the name of a company (e.g. ‘gsk’) associated with a restricted access group.

‘username’: group is set to the current user’s username ‘data_owner_group’: group is set to data_owner_group Otherwise, group is set by hard-coded dictionary.
Returns:
None

pipeline.model_pipeline module

pipeline.model_tracker module

Module to interface model pipeline to model tracker service.

exception pipeline.model_tracker.DatastoreInsertionException[source]

Bases: Exception

exception pipeline.model_tracker.MLMTClientInstantiationException[source]

Bases: Exception

pipeline.model_tracker.convert_metadata(old_metadata)[source]

Convert model metadata from old format (with camel-case parameter group names) to new format.

Args:
old_metadata (dict): Model metadata in old format
Returns:
new_metadata (dict): Model metadata in new format
pipeline.model_tracker.export_model(model_uuid, collection, model_dir, alt_bucket='CRADA')[source]

Export the metadata (parameters) and other files needed to recreate a model from the model tracker database to a gzipped tar archive.

Args:

model_uuid (str): Model unique identifier

collection (str): Name of the collection holding the model in the database.

model_dir (str): Path to directory where the model metadata and parameter files will be written. The directory will be created if it doesn’t already exist. Subsequently, the directory contents will be packed into a gzipped tar archive named model_dir.tar.gz.

alt_bucket (str): Alternate datastore bucket to search for model tarball and transformer objects.

Returns:
none
pipeline.model_tracker.extract_datastore_model_tarball(model_uuid, model_bucket, output_dir, model_dir)[source]

Load a model tarball saved in the datastore and check the format. If it is a new style tarball (containing the model metadata and transformers along with the model state), unpack it into output_dir. Otherwise it contains the model state only; unpack it into model_dir.

Args:

model_uuid (str): UUID of model to be retrieved

model_bucket (str): Datastore bucket containing model tarball file

output_dir (str): Output directory to unpack tarball into if it’s in the new format

model_dir (str): Output directory to unpack tarball into if it’s in the old format

Returns:
extract_dir (str): The directory (output_dir or model_dir) the tarball was extracted into.
pipeline.model_tracker.get_full_metadata(filter_dict, collection_name=None)[source]

Retrieve relevant full metadata (including training run metrics) of models matching given criteria.

Args:

filter_dict (dict): dictionary to filter on

collection_name (str): Name of collection to search

Returns:
A list of matching full model metadata (including training run metrics) dictionaries. Raises MongoQueryException if the query fails.
pipeline.model_tracker.get_full_metadata_by_uuid(model_uuid, collection_name=None)[source]

Retrieve model parameter metadata for the given model_uuid and collection. The returned metadata dictionary will include training run performance metrics and training dataset metadata.

Args:

model_uuid (str): model unique identifier

collection_name(str): collection to search (optional, searches all collections if not specified)

Returns:
Matching metadata dictionary. Raises MongoQueryException if the query fails.
pipeline.model_tracker.get_metadata_by_uuid(model_uuid, collection_name=None)[source]

Retrieve model parameter metadata by model_uuid. The resulting metadata dictionary can be passed to parameter_parser.wrapper(); it does not contain performance metrics or training dataset metadata.

Args:

model_uuid (str): model unique identifier

collection_name(str): collection to search (optional, searches all collections if not specified)

Returns:
Matching metadata dictionary. Raises MongoQueryException if the query fails.
pipeline.model_tracker.get_model_collection_by_uuid(model_uuid, mlmt_client=None)[source]

Retrieve model collection given a uuid.

Args:

model_uuid (str): model uuid

mlmt_client: Ignored

Returns:
Matching collection name
Raises:
ValueError if there is no collection containing a model with the given uuid.
pipeline.model_tracker.get_model_training_data_by_uuid(uuid)[source]

Retrieve data used to train, validate, and test a model given the uuid

Args:
uuid (str): model uuid
Returns:
a tuple of datafraes containint training data, validation data, and test data including the compound ID, RDKIT SMILES, and response value
pipeline.model_tracker.save_model(pipeline, collection_name='model_tracker', log=True)[source]

Save the model.

Save the model files to the datastore and save the model metadata dict to the Mongo database.

Args:

pipeline (ModelPipeline object): the pipeline to use

collection_name (str): the name of the Mongo DB collection to use

log (bool): True if logs should be printed, default False

use_personal_client (bool): True if personal client should be used (i.e. for testing), default False

Returns:
None if insertion was successful, raises DatastoreInsertionException, MLMTClientInstantiationException or MongoInsertionException otherwise
pipeline.model_tracker.save_model_tarball(output_dir, model_tarball_path)[source]

Save the model parameters, metadata and transformers as a portable gzipped tar archive.

Args:

output_dir (str): Output directory from model training

model_tarball_path (str): Path of tarball file to be created

Returns:
None

pipeline.model_wrapper module

pipeline.parameter_parser module

pipeline.perf_data module

Contains class PerfData and its subclasses, which are objects for collecting and computing model performance metrics and predictions

class pipeline.perf_data.ClassificationPerfData(model_dataset, subset)[source]

Bases: pipeline.perf_data.PerfData

Class with methods for accumulating classification model prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run. Abstract class with concrete subclasses for different split strategies.

Attributes:
set in __init__

num_tasks (int): Set to None, the number of tasks

num_cmpds (int): Set to None, the number of compounds

num_classes (int): Set to None, the number of classes

accumulate_preds(predicted_vals, ids, pred_stds=None)[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_pred_values()[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_prediction_results()[source]

Returns a dictionary of performance metrics for a classification model. The dictionary values will contain only primitive Python types, so that it can be easily JSONified.

Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
Returns:
pred_results (dict): dictionary of performance metrics for a classification model.
model_choice_score(score_type='roc_auc')[source]

Computes a score function based on the accumulated predicted values, to be used for selecting the best training epoch and other hyperparameters.

Args:
score_type (str): The name of the scoring metric to be used, e.g. ‘roc_auc’, ‘precision’,
‘recall’, ‘f1’; see https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter and sklearn.metrics.SCORERS.keys() for a complete list of options. Larger values of the score function indicate better models.
Returns:
score (float): A score function value. For multitask models, this will be averaged
over tasks.
class pipeline.perf_data.EpochManager(wrapper, subsets={'test': 'test', 'train': 'train', 'valid': 'valid'}, **kwargs)[source]

Bases: object

Manages lists of PerfDatas

This class manages lists of PerfDatas as well as variables related to iteratively training a model over several epochs. This class sets several varaibles in a given ModelWrapper for the sake of backwards compatibility

Attributes:
Set in __init__:
_subsets (dict): Must contain the keys ‘train’, ‘valid’, ‘test’. The values
are used as subsets when calling create_perf_data.

_model_choice_score_type (str): Passed into PerfData.model_choice_score

_log (logger): This is the from wrapper.log

_should_stop (bool): True when training as satisfied stopping conditions. Either
it has reached the max number of epochs or has exceeded early_stopping_patience

wrapper (ModelWrapper): The model wrapper where this object is being used.

_new_best_valid_score (function): This function takes no arguments and is called
whenever a new best validation score is achieved.
accumulate(ei, subset, dset)[source]

Accumulate predictions

Makes predictions, accumulate predictions and calculate the performance metric. Calls PerfData.accumulate_preds belonging to the epoch, subset, and given dataset.

Args:

ei (int): Epoch index

subset (str): Which subset, should be train, valid, or test.

dset (dc.data.Dataset): Calculates the performance for the given dset

Returns:
float: Performance metric for the given dset.
compute(ei, subset)[source]

Computes performance metrics

This calls PerfData.compute_perf_metrics and saves the result in f’{subset}_epoch_perfs’

Args:

ei (int): Epoch index

subset (str): Which subset to compute_perf_metrics. Should be train, valid, or test

Returns:
None
on_new_best_valid(functional)[source]

Sets the function called when a new best validation score is achieved

Saves the function called when there’s a new best validation score.

Args:
functional (function): This function takes no arguments and returns nothing. This
function is called when there’s a new best validation score. This can be used to tell the ModelWrapper to save the model.
Returns:
None
Side effect:
Saves the _new_best_valid_score function.
set_make_pred(functional)[source]

Sets the function used to make predictions

Sets the function used to make predictions. This must be called before invoking self.update and self.accumulate

Args:
functional (function): This function takes one argument, a dc.data.Dataset, and
returns an array of predictions for that dset. This function is called when updating the training state after a given epoch.
Returns:
None
Side effects:
Saves the functional as self._make_pred
should_stop()[source]

Returns True when the training loop should stop

Returns:
bool: True when the training loop should stop
update(ei, subset, dset=None)[source]

Update training state

Updates the training state for a given subset and epoch index with the given dataset.

Args:

ei (int): Epoch index.

subset (str): Should be train, valid, test

dset (dc.data.Dataset): Updates using this dset

Returns:
perf (float): the performance of the given dset.
update_epoch(ei, train_dset=None, valid_dset=None, test_dset=None)[source]

Update training state after an epoch

This function updates train/valid/test_perf_data. Call this function once per epoch. Call self.should_stop() after calling this function to see if you should exit the training loop.

Subsets with None arguments will be ignored

Args:

ei (int): The epoch index

train_dset (dc.data.Dataset): The train dataset

valid_dset (dc.data.Dataset): The valid dataset. Providing this argument updates
best_valid_score and _should_stop

test_dset (dc.data.Dataset): The test dataset

Returns:
list: A list of performance values for the provided datasets.
Side effects
This function updates self._should_stop
update_valid(ei)[source]

Checks validation score

Checks validation performance of the given epoch index. Updates self._should_stop, checks on early stopping conditions, calls self._new_best_valid_score() when necessary.

Args:
ei (int): Epoch index
Returns:
None
Side effects
Updates self._should_stop when it’s time to exit the training loop.
class pipeline.perf_data.EpochManagerKFold(wrapper, subsets={'test': 'test', 'train': 'train', 'valid': 'valid'}, **kwargs)[source]

Bases: pipeline.perf_data.EpochManager

This class manages the training state when using KFold cross validation. This is necessary because this manager uses f’{subset}_epoch_perf_stds’ unlike EpochManager

compute(ei, subset)[source]

Calls PerfData.compute_perf_metrics()

This differs from EpochManager.compute in that it saves the results into f’{subset}_epoch_perf_stds’

Args:

ei (int): Epoch index

subset (str): Should be train, valid, test.

Returns:
None
class pipeline.perf_data.HybridPerfData(model_dataset, subset)[source]

Bases: pipeline.perf_data.PerfData

Class with methods for accumulating regression model prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run. Abstract class with concrete subclasses for different split strategies.

Attributes:
set in __init__

num_tasks (int): Set to None, the number of tasks

num_cmpds (int): Set to None, the number of compounds

accumulate_preds(predicted_vals, ids, pred_stds=None)[source]
Raises:
NotImplementedError: The method is implemented by subclasses
compute_perf_metrics(per_task=False)[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_pred_values()[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_prediction_results()[source]

Returns a dictionary of performance metrics for a regression model. The dictionary values should contain only primitive Python types, so that it can be easily JSONified.

Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
Returns:
pred_results (dict): dictionary of performance metrics for a regression model.
model_choice_score(score_type='r2')[source]

Computes a score function based on the accumulated predicted values, to be used for selecting the best training epoch and other hyperparameters.

Args:
score_type (str): The name of the scoring metric to be used, e.g. ‘r2’, ‘mae’, ‘rmse’
Returns:
score (float): A score function value. For multitask models, this will be averaged
over tasks.
class pipeline.perf_data.KFoldClassificationPerfData(model_dataset, transformers, subset, predict_probs=True, transformed=True)[source]

Bases: pipeline.perf_data.ClassificationPerfData

Class with methods for accumulating classification model performance data over multiple cross-validation folds and computing performance metrics after all folds have been run.

Attributes:
Set in __init__:
subset (str): Label of the type of subset of dataset for tracking predictions num_cmps (int): The number of compounds in the dataset num_tasks (int): The number of tasks in the dataset pred-vals (dict): The dictionary of prediction results folds (int): Initialized at zero, flag for determining which k-fold is being assessed transformers (list of Transformer objects): from input arguments real_vals (dict): The dictionary containing the origin response column values class_names (np.array): Assumes the classes are of deepchem index type (e.g. 0,1,2,…) num_classes (int): The number of classes to predict on
accumulate_preds(predicted_vals, ids, pred_stds=None)[source]

Add training, validation or test set predictions from the current fold to the data structure where we keep track of them.

Args:

predicted_vals (np.array): Array of the predicted values for the current dataset

ids (np.array): An np.array of compound ids for the current dataset

pred_stds (np.array): An array of the standard deviation in the predictions, not used in this method

Returns:
None
Side effects:

Overwrites the attribute pred_vals

Increments folds by 1

compute_perf_metrics(per_task=False)[source]

Computes the ROC AUC metrics for each task based on the accumulated values, averaged over training folds, along with standard deviations of the scores. If per_task is False, the scores are averaged over tasks and the overall standard deviation is reported instead.

Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
Returns:

A tuple (roc_auc_mean, roc_auc_std):

roc_auc_mean: A numpy array of mean ROC AUC scores for each task, averaged over folds, if per_task is True.

Otherwise, a float giving the ROC AUC score averaged over both folds and tasks.
roc_auc_std: A numpy array of standard deviations over folds of ROC AUC values, if per_task is True.
Otherwise, a float giving the overall standard deviation.
get_pred_values()[source]

Returns the predicted values accumulated over training, with any transformations undone. If self.subset is ‘train’, ‘train_valid’ or ‘test’, the function will return the means and standard deviations of the class probabilities over the training folds for each compound, for each task. Otherwise, returns a single set of predicted probabilites for each validation set compound. For all subsets, returns the compound IDs and the most probable classes for each task.

Returns:

ids (list): list of compound IDs.

pred_classes (np.array): an (ncmpds, ntasks) array of predicted classes.

class_probs (np.array): a (ncmpds, ntasks, nclasses) array of predicted probabilities for the classes, and

prob_stds (np.array): a (ncmpds, ntasks, nclasses) array of standard errors over folds for the class probability estimates (only available for the ‘train’ and ‘test’ subsets; None otherwise).

get_real_values(ids=None)[source]

Returns the real dataset response values as an (ncmpds, ntasks, nclasses) array of indicator bits (if nclasses > 2) or an (ncmpds, ntasks) array of binary classes (if nclasses == 2), with compound IDs in the same order as in the return from get_pred_values() (unless ids is specified).

Args:
ids (list of str): Optional list of compound IDs to return values for.
Returns:
np.array of shape (ncmpds, tasks, nclasses): of either indicator bits or a 2D array of binary classes
get_weights(ids=None)[source]

Returns the dataset response weights, as an (ncmpds, ntasks) array in the same ID order as get_pred_values() (unless ids is specified).

Args:
ids (list of str): Optional list of compound IDs to return values for.
Returns:
np.array (ncmpds, ntasks) of the real dataset response weights, in the same ID order as get_pred_values().
class pipeline.perf_data.KFoldRegressionPerfData(model_dataset, transformers, subset, transformed=True)[source]

Bases: pipeline.perf_data.RegressionPerfData

Class with methods for accumulating regression model prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run.

Arguments:
Set in __init__:

subset (str): Label of the type of subset of dataset for tracking predictions

num_cmps (int): The number of compounds in the dataset

num_tasks (int): The number of tasks in the dataset

pred-vals (dict): The dictionary of prediction results

folds (int): Initialized at zero, flag for determining which k-fold is being assessed

transformers (list of Transformer objects): from input arguments

real_vals (dict): The dictionary containing the origin response column values

accumulate_preds(predicted_vals, ids, pred_stds=None)[source]

Add training, validation or test set predictions from the current fold to the data structure where we keep track of them.

Args:

predicted_vals (np.array): Array of the predicted values for the current dataset

ids (np.array): An np.array of compound ids for the current dataset

pred_stds (np.array): An array of the standard deviation in the predictions, not used in this method

Returns:
None
Raises:
ValueError: If Predicted value dimensions don’t match num_tasks for RegressionPerfData
Side effects:

Overwrites the attribute pred_vals

Increments folds by 1

compute_perf_metrics(per_task=False)[source]

Computes the R-squared metrics for each task based on the accumulated values, averaged over training folds, along with standard deviations of the scores. If per_task is False, the scores are averaged over tasks and the overall standard deviation is reported instead.

Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
Returns:

A tuple (r2_mean, r2_std):

r2_mean: A numpy array of mean R^2 scores for each task, averaged over folds, if per_task is True.
Otherwise, a float giving the R^2 score averaged over both folds and tasks.
r2_std: A numpy array of standard deviations over folds of R^2 values, if per_task is True.
Otherwise, a float giving the overall standard deviation.
get_pred_values()[source]

Returns the predicted values accumulated over training, with any transformations undone. If self.subset is ‘train’ or ‘test’, the function will return averages over the training folds for each compound along with standard deviations when there are predictions from multiple folds. Otherwise, returns a single predicted value for each compound.

Returns:

ids (np.array): list of compound IDs

vals (np.array): (ncmpds, ntasks) array of mean predicted values

fold_stds (np.array): (ncmpds, ntasks) array of standard deviations over folds if applicable, and None otherwise.

get_real_values(ids=None)[source]

Returns the real dataset response values, with any transformations undone, as an (ncmpds, ntasks) array in the same ID order as get_pred_values() (unless ids is specified).

Args:
ids (list of str): Optional list of compound IDs to return values for.
Returns:
np.array (ncmpds, ntasks) of the real dataset response values, with any transformations undone, in the same ID order as get_pred_values().
get_weights(ids=None)[source]

Returns the dataset response weights, as an (ncmpds, ntasks) array in the same ID order as get_pred_values() (unless ids is specified).

Args:
ids (list of str): Optional list of compound IDs to return values for.
Returns:
np.array (ncmpds, ntasks) of the real dataset response weights, in the same ID order as get_pred_values().
class pipeline.perf_data.PerfData(model_dataset, subset)[source]

Bases: object

Class with methods for accumulating prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run. Abstract class with concrete subclasses for classification and regression models.

accumulate_preds(predicted_vals, ids, pred_stds=None)[source]
Raises:
NotImplementedError: The method is implemented by subclasses
compute_perf_metrics(per_task=False)[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_pred_values()[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_prediction_results()[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_real_values(ids=None)[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_weights(ids=None)[source]

Returns the dataset response weights as an (ncmpds, ntasks) array

Raises:
NotImplementedError: The method is implemented by subclasses
class pipeline.perf_data.RegressionPerfData(model_dataset, subset)[source]

Bases: pipeline.perf_data.PerfData

Class with methods for accumulating regression model prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run. Abstract class with concrete subclasses for different split strategies.

Attributes:
set in __init__

num_tasks (int): Set to None, the number of tasks

num_cmpds (int): Set to None, the number of compounds

accumulate_preds(predicted_vals, ids, pred_stds=None)[source]
Raises:
NotImplementedError: The method is implemented by subclasses
compute_perf_metrics(per_task=False)[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_pred_values()[source]
Raises:
NotImplementedError: The method is implemented by subclasses
get_prediction_results()[source]

Returns a dictionary of performance metrics for a regression model. The dictionary values should contain only primitive Python types, so that it can be easily JSONified.

Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
Returns:
pred_results (dict): dictionary of performance metrics for a regression model.
model_choice_score(score_type='r2')[source]

Computes a score function based on the accumulated predicted values, to be used for selecting the best training epoch and other hyperparameters.

Args:
score_type (str): The name of the scoring metric to be used, e.g. ‘r2’,
‘neg_mean_squared_error’, ‘neg_mean_absolute_error’, etc.; see https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter and sklearn.metrics.SCORERS.keys() for a complete list of options. Larger values of the score function indicate better models.
Returns:
score (float): A score function value. For multitask models, this will be averaged
over tasks.
class pipeline.perf_data.SimpleClassificationPerfData(model_dataset, transformers, subset, predict_probs=True, transformed=True)[source]

Bases: pipeline.perf_data.ClassificationPerfData

Class with methods for collecting classification model prediction and performance data from single-fold training and prediction runs.

Attributes:
Set in __init__:

subset (str): Label of the type of subset of dataset for tracking predictions

num_cmps (int): The number of compounds in the dataset

num_tasks (int): The number of tasks in the dataset

pred-vals (dict): The dictionary of prediction results

folds (int): Initialized at zero, flag for determining which k-fold is being assessed

transformers (list of Transformer objects): from input arguments

real_vals (dict): The dictionary containing the origin response column values

class_names (np.array): Assumes the classes are of deepchem index type (e.g. 0,1,2,…)

num_classes (int): The number of classes to predict on

accumulate_preds(predicted_vals, ids, pred_stds=None)[source]

Add training, validation or test set predictions from the current dataset to the data structure where we keep track of them.

Arguments:

predicted_vals (np.array): Array of predicted values (class probabilities)

ids (list): List of the compound ids of the dataset

pred_stds (np.array): Optional np.array of the prediction standard deviations

Side effects:
Updates self.pred_vals and self.perf_metrics
compute_perf_metrics(per_task=False)[source]

Returns the ROC_AUC metrics for each task based on the accumulated predictions. If per_task is False, returns the average ROC AUC over tasks.

Args:
per_task (bool): Whether to return individual ROC AUC scores for each task
Returns:
A tuple (roc_auc, std):
roc_auc: A numpy array of ROC AUC scores, if per_task is True. Otherwise,
a float giving the mean ROC AUC score over tasks.

std: Placeholder for an array of standard deviations. Always None for this class.

get_pred_values()[source]

Returns the predicted values accumulated over training, with any transformations undone. If self.subset is ‘train’, the function will average class probabilities over the k-1 folds in which each compound was part of the training set, and return the most probable class. Otherwise, there should be a single set of predicted probabilites for each validation or test set compound. Returns a tuple (ids, pred_classes, class_probs, prob_stds), where ids is the list of compound IDs, pred_classes is an (ncmpds, ntasks) array of predicted classes, class_probs is a (ncmpds, ntasks, nclasses) array of predicted probabilities for the classes, and prob_stds is a (ncmpds, ntasks, nclasses) array of standard errors for the class probability estimates.

Returns:
Tuple (ids, pred_classes, class_probs, prob_stds)

ids (list): Contains the dataset compound ids

pred_classes (np.array): Contains (ncmpds, ntasks) array of prediction classes

class_probs (np.array): Contains (ncmpds, ntasks, nclasses) array of predict class probabilities

prob_stds (np.array): Contains (ncmpds, ntasks, nclasses) array of standard errors for the class probability estimates

get_real_values(ids=None)[source]

Returns the real dataset response values as an (ncmpds, ntasks, nclasses) array of indicator bits. If nclasses == 2, the returned array has dimension (ncmpds, ntasks).

Args:
ids: Ignored for this class
Returns:
np.array of the response values of the real dataset as indicator bits
get_weights(ids=None)[source]

Returns the dataset response weights

Args:
ids: Ignored for this class
Returns:
np.array: Containing the dataset response weights
class pipeline.perf_data.SimpleHybridPerfData(model_dataset, transformers, subset, is_ki, ki_convert_ratio=None, transformed=True)[source]

Bases: pipeline.perf_data.HybridPerfData

Class with methods for accumulating hybrid model prediction data from training, validation or test sets and computing performance metrics.

Attributes:
Set in __init__:

subset (str): Label of the type of subset of dataset for tracking predictions

num_cmps (int): The number of compounds in the dataset

num_tasks (int): The number of tasks in the dataset

pred-vals (dict): The dictionary of prediction results

folds (int): Initialized at zero, flag for determining which k-fold is being assessed

transformers (list of Transformer objects): from input arguments

real_vals (dict): The dictionary containing the origin response column values

accumulate_preds(predicted_vals, ids, pred_stds=None)[source]

Add training, validation or test set predictions to the data structure where we keep track of them.

Args:

predicted_vals (np.array): Array of predicted values

ids (list): List of the compound ids of the dataset

pred_stds (np.array): Optional np.array of the prediction standard deviations

Side effects:
Reshapes the predicted values and the standard deviations (if they are given)
compute_perf_metrics(per_task=False)[source]

Returns the R-squared metrics for each task or averaged over tasks based on the accumulated values

Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
Returns:
A tuple (r2_score, std):

r2_score (np.array): An array of scores for each task, if per_task is True. Otherwise, it is a float containing the average R^2 score over tasks.

std: Always None for this class.

get_pred_values()[source]

Returns the predicted values accumulated over training, with any transformations undone. Returns a tuple (ids, values, stds), where ids is the list of compound IDs, values is a (ncmpds, ntasks) array of predictions, and stds is always None for this class.

Returns:
Tuple (ids, vals, stds)

ids (list): Contains the dataset compound ids

vals (np.array): Contains (ncmpds, ntasks) array of prediction

stds (np.array): Contains (ncmpds, ntasks) array of prediction standard deviations

get_real_values(ids=None)[source]

Returns the real dataset response values, with any transformations undone, as an (ncmpds, ntasks) array with compounds in the same ID order as in the return from get_pred_values().

Args:
ids: Ignored for this class
Returns:
np.array: Containing the real dataset response values with transformations undone.
get_weights(ids=None)[source]

Returns the dataset response weights as an (ncmpds, ntasks) array

Args:
ids: Ignored for this class
Returns:
np.array: Containing the dataset response weights
class pipeline.perf_data.SimpleRegressionPerfData(model_dataset, transformers, subset, transformed=True)[source]

Bases: pipeline.perf_data.RegressionPerfData

Class with methods for accumulating regression model prediction data from training, validation or test sets and computing performance metrics.

Attributes:
Set in __init__:

subset (str): Label of the type of subset of dataset for tracking predictions

num_cmps (int): The number of compounds in the dataset

num_tasks (int): The number of tasks in the dataset

pred-vals (dict): The dictionary of prediction results

folds (int): Initialized at zero, flag for determining which k-fold is being assessed

transformers (list of Transformer objects): from input arguments

real_vals (dict): The dictionary containing the origin response column values

accumulate_preds(predicted_vals, ids, pred_stds=None)[source]

Add training, validation or test set predictions to the data structure where we keep track of them.

Args:

predicted_vals (np.array): Array of predicted values

ids (list): List of the compound ids of the dataset

pred_stds (np.array): Optional np.array of the prediction standard deviations

Side effects:
Reshapes the predicted values and the standard deviations (if they are given)
compute_perf_metrics(per_task=False)[source]

Returns the R-squared metrics for each task or averaged over tasks based on the accumulated values

Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
Returns:
A tuple (r2_score, std):

r2_score (np.array): An array of scores for each task, if per_task is True. Otherwise, it is a float containing the average R^2 score over tasks.

std: Always None for this class.

get_pred_values()[source]

Returns the predicted values accumulated over training, with any transformations undone. Returns a tuple (ids, values, stds), where ids is the list of compound IDs, values is a (ncmpds, ntasks) array of predictions, and stds is always None for this class.

Returns:
Tuple (ids, vals, stds)

ids (list): Contains the dataset compound ids

vals (np.array): Contains (ncmpds, ntasks) array of prediction

stds (np.array): Contains (ncmpds, ntasks) array of prediction standard deviations

get_real_values(ids=None)[source]

Returns the real dataset response values, with any transformations undone, as an (ncmpds, ntasks) array with compounds in the same ID order as in the return from get_pred_values().

Args:
ids: Ignored for this class
Returns:
np.array: Containing the real dataset response values with transformations undone.
get_weights(ids=None)[source]

Returns the dataset response weights as an (ncmpds, ntasks) array

Args:
ids: Ignored for this class
Returns:
np.array: Containing the dataset response weights
pipeline.perf_data.create_perf_data(prediction_type, model_dataset, transformers, subset, **kwargs)[source]

Factory function that creates the right kind of PerfData object for the given subset, prediction_type (classification or regression) and split strategy (k-fold or train/valid/test).

Args:

prediction_type (str): classification or regression.

model_dataset (ModelDataset): Object representing the full dataset.

transformers (list): A list of transformer objects.

subset (str): Label in [‘train’, ‘valid’, ‘test’, ‘full’], indicating the type of subset of dataset for tracking predictions

**kwargs: Additional PerfData subclass arguments

Returns:
PerfData object
Raises:
ValueError: if split_strategy not in [‘train_valid_test’,’k_fold_cv’] ValueError: prediction_type not in [‘regression’,’classification’]
pipeline.perf_data.negative_predictive_value(y_real, y_pred)[source]

Computes negative predictive value of a binary classification model: NPV = TN/(TN+FN).

Args:

y_real (np.array): Array of ground truth values

y_pred (np.array): Array of predicted values

Returns:
(float): The negative predictive value
pipeline.perf_data.rms_error(y_real, y_pred)[source]

Calculates the root mean squared error. Score function used for model selection.

Args:

y_real (np.array): Array of ground truth values

y_pred (np.array): Array of predicted values

Returns:
(np.array): root mean squared error of the input

pipeline.perf_plots module

Plotting routines for visualizing performance of regression and classification models

pipeline.perf_plots.plot_ROC_curve(MP, epoch_label='best', pdf_dir=None)[source]

Plot ROC curves for a classification model.

Args:

MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.

epoch_label (str): Label for training epoch to draw predicted values from. Currently ‘best’ is the only allowed value.

pdf_dir (str): If given, output the plots to a PDF file in the given directory.

Returns:
None
pipeline.perf_plots.plot_perf_vs_epoch(MP, pdf_dir=None)[source]

Plot the current NN model’s standard performance metric (r2_score or roc_auc_score) vs epoch number for the training, validation and test subsets. If the model was trained with k-fold CV, plot shading for the validation set out to += 1 SD from the mean score metric values, and plot the training and test set metrics from the final model retraining rather than the cross-validation phase. Make a second plot showing the validation set model choice score used for ranking training epochs and other hyperparameters against epoch number.

Args:

MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.

pdf_dir (str): If given, output the plots to a PDF file in the given directory.

Returns:
None
pipeline.perf_plots.plot_prec_recall_curve(MP, epoch_label='best', pdf_dir=None)[source]

Plot precision-recall curves for a classification model.

Args:

MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.

epoch_label (str): Label for training epoch to draw predicted values from. Currently ‘best’ is the only allowed value.

pdf_dir (str): If given, output the plots to a PDF file in the given directory.

Returns:
None
pipeline.perf_plots.plot_pred_vs_actual(MP, epoch_label='best', threshold=None, error_bars=False, pdf_dir=None)[source]

Plot predicted vs actual values from a trained regression model for each split subset (train, valid, and test).

Args:

MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.

epoch_label (str): Label for training epoch to draw predicted values from. Currently ‘best’ is the only allowed value.

threshold (float): Threshold activity value to mark on plot with dashed lines.

error_bars (bool): If true and if uncertainty estimates are included in the model predictions, draw error bars
at +- 1 SD from the predicted y values.

pdf_dir (str): If given, output the plots to a PDF file in the given directory.

Returns:
None
pipeline.perf_plots.plot_umap_feature_projections(MP, ndim=2, num_neighbors=20, min_dist=0.1, fit_to_train=True, dist_metric='euclidean', dist_metric_kwds={}, target_weight=0, random_seed=17, pdf_dir=None)[source]

Projects features of a model’s input dataset using UMAP to 2D or 3D coordinates and draws a scatterplot. Shape-codes plot markers to indicate whether the associated compound was in the training, validation or test set. For classification models, also uses the marker shape to indicate whether the compound’s class was correctly predicted, and uses color to indicate whether the true class was active or inactive. For regression models, uses the marker color to indicate the discrepancy between the predicted and actual values.

Args:

MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.

ndim (int): Number of dimensions (2 or 3) to project features into.

num_neighbors (int): Number of nearest neighbors used by UMAP for manifold approximation.
Larger values give a more global view of the data, while smaller values preserve more local detail.

min_dist (float): Parameter used by UMAP to set minimum distance between projected points.

fit_to_train (bool): If true (the default), fit the UMAP projection to the training set feature vectors only.
Otherwise, fit it to the entire dataset.
dist_metric (str): Name of metric to use for initial distance matrix computation. Check UMAP documentation
for supported values. The metric should be appropriate for the type of features used in the model (fingerprints or descriptors); note that jaccard is equivalent to Tanimoto distance for ECFP fingerprints.
dist_metric_kwds (dict): Additional key-value pairs used to parameterize dist_metric; see the UMAP documentation.
In particular, dist_metric_kwds[‘p’] specifies the power/exponent for the Minkowski metric.
target_weight (float): Weighting factor determining balance between activities and feature values in determining topology
of projected points. A weight of zero prioritizes the feature vectors; weight = 1 prioritizes the activity values, so that compounds with the same activity tend to be clustered together.

random_seed (int): Seed for random number generator.

pdf_dir (str): If given, output the plot to a PDF file in the given directory.

Returns:
None
pipeline.perf_plots.plot_umap_train_set_neighbors(MP, num_neighbors=20, min_dist=0.1, dist_metric='euclidean', dist_metric_kwds={}, random_seed=17, pdf_dir=None)[source]

Project features of whole dataset to 2 dimensions, without regard to response values. Plot training & validation set or training and test set compounds, color- and symbol-coded according to actual classification and split set. The plot does not take predicted values into account at all. Does not work with regression data.

Args:

MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.

num_neighbors (int): Number of nearest neighbors used by UMAP for manifold approximation.
Larger values give a more global view of the data, while smaller values preserve more local detail.

min_dist (float): Parameter used by UMAP to set minimum distance between projected points.

dist_metric (str): Name of metric to use for initial distance matrix computation. Check UMAP documentation
for supported values. The metric should be appropriate for the type of features used in the model (fingerprints or descriptors); note that jaccard is equivalent to Tanimoto distance for ECFP fingerprints.
dist_metric_kwds (dict): Additional key-value pairs used to parameterize dist_metric; see the UMAP documentation.
In particular, dist_metric_kwds[‘p’] specifies the power/exponent for the Minkowski metric.

random_seed (int): Seed for random number generator.

pdf_dir (str): If given, output the plot to a PDF file in the given directory.

pipeline.splitting module

Encapsulates everything that depends on how datasets are split: the splitting itself, training, validation, testing, generation of predicted values and performance metrics.

class pipeline.splitting.KFoldSplitting(params)[source]

Bases: pipeline.splitting.Splitting

Subclass to deal with everything related to k-fold cross-validation splits

Attributes:
Set in __init__:

params (Namespace object): contains all parameter information

split (str): Type of splitter in [‘index’,’random’,’scaffold’,’butina’,’ave_min’,’stratified’]

splitter (Deepchem split object): A splitting object of the subtype specified by split

num_folds (int): The number of k-fold splits to perform

get_split_prefix(parent='')[source]

Returns a string identifying the split strategy (TVT or k-fold) and the splitting method (index, scaffold, etc.) for use in filenames, dataset keys, etc.

Args:
parent (str): Default to empty string. Sets the parent directory for the output string
Returns:
(str): A string that identifies the split strategy and the splitting method. Appends a parent directory in front of the fold description
split_dataset(dataset, attr_df, smiles_col)[source]

Splits dataset into training, testing and validation sets.

Args:

dataset (deepchem Dataset): full featurized dataset

attr_df (Pandas DataFrame): dataframe containing SMILES strings indexed by compound IDs,

smiles_col (string): name of SMILES column (hack for now until deepchem fixes scaffold and butina splitters)

Returns:

[(train, valid)], test, [(train_attr, valid_attr)], test_attr:

train (deepchem Dataset): training dataset.

valid (deepchem Dataset): validation dataset.

test (deepchem Dataset): testing dataset.

train_attr (Pandas DataFrame): dataframe of SMILES strings indexed by compound IDs for training set.

valid_attr (Pandas DataFrame): dataframe of SMILES strings indexed by compound IDs for validation set.

test_attr (Pandas DataFrame): dataframe of SMILES strings indexed by compound IDs for test set.

Raises:
Exception if there are duplicate ids or smiles strings in the dataset or the attr_df
class pipeline.splitting.Splitting(params)[source]

Bases: object

Base class for train/validation/test and k-fold dataset splitting. Wrapper for DeepChem Splitter classes that handle the specific splitting methods (e.g. random, scaffold, etc.).

Attributes:
Set in __init__:

params (Namespace object): contains all parameter information

split (str): Type of splitter in [‘index’,’random’,’scaffold’,’butina’,’ave_min’,’stratified’]

splitter (Deepchem split object): A splitting object of the subtype specified by split

get_split_prefix(parent='')[source]

Must be implemented by subclasses

Raises:
NotImplementedError: The method is implemented by subclasses
needs_smiles()[source]

Returns True if underlying DeepChem splitter requires compound IDs to be SMILES strings

Returns:
(bool): True if Deepchem splitter requires SMILES strings as compound IDs, currently only true if using scaffold or butina splits
split_dataset(dataset, attr_df, smiles_col)[source]

Must be implemented by subclasses

Raises:
NotImplementedError: The method is implemented by subclasses
class pipeline.splitting.TrainValidTestSplitting(params)[source]

Bases: pipeline.splitting.Splitting

Subclass to deal with everything related to standard train/validation/test splits

Attributes:
Set in __init__:

params (Namespace object): contains all parameter information

split (str): Type of splitter in [‘index’,’random’,’scaffold’,’butina’,’ave_min’,’temporal’,’stratified’]

splitter (Deepchem split object): A splitting object of the subtype specified by split

num_folds (int): The number of k-fold splits to perform

get_split_prefix(parent='')[source]

Returns a string identifying the split strategy (TVT or k-fold) and the splitting method (index, scaffold, etc.) for use in filenames, dataset keys, etc.

Args:
parent (str): Default to empty string. Sets the parent directory for the output string
Returns:
(str): A string that identifies the split strategy and the splitting method. Appends a parent directory in front of the fold description
split_dataset(dataset, attr_df, smiles_col)[source]

Splits dataset into training, testing and validation sets.

For ave_min, random, scaffold, index splits
self.params.split_valid_frac & self.params.split_test_frac should be defined and train_frac = 1.0 - self.params.split_valid_frac - self.params.split_test_frac
For butina split, test size is not user defined, and depends on available clusters that qualify for placement in the test set
train_frac = 1.0 - self.params.split_valid_frac
For temporal split, test size is also not user defined, and depends on number of compounds with dates after cutoff date.
train_frac = 1.0 - self.params.split_valid_frac
Args:

dataset (deepchem Dataset): full featurized dataset

attr_df (Pandas DataFrame): dataframe containing SMILES strings indexed by compound IDs,

smiles_col (string): name of SMILES column (hack for now until deepchem fixes scaffold and butina splitters)

Returns:

[(train, valid)], test, [(train_attr, valid_attr)], test_attr: train (deepchem Dataset): training dataset.

valid (deepchem Dataset): validation dataset.

test (deepchem Dataset): testing dataset.

train_attr (Pandas DataFrame): dataframe of SMILES strings indexed by compound IDs for training set.

valid_attr (Pandas DataFrame): dataframe of SMILES strings indexed by compound IDs for validation set.

test_attr (Pandas DataFrame): dataframe of SMILES strings indexed by compound IDs for test set.

Raises:
Exception if there are duplicate ids or smiles strings in the dataset or the attr_df
pipeline.splitting.check_if_dupe_smiles_dataset(dataset, attr_df, smiles_col)[source]

Returns a boolean. True if there are duplication within the deepchem Dataset dataset.ids or if there are duplicates in the smiles_col of the attr_df

Args:

dataset (deepchem Dataset): full featurized dataset

attr_df (Pandas DataFrame): dataframe containing SMILES strings indexed by compound IDs,

smiles_col (string): name of SMILES column (hack for now until deepchem fixes scaffold and butina splitters)

Returns:
(bool): True if there duplicates in the ids of the dataset or if there are duplicates in the smiles_col of the attr_df. False if there are no duplicates
pipeline.splitting.create_splitting(params)[source]

Factory function to create appropriate type of Splitting object, based on dataset parameters

Args:
params (Namespace object): contains all parameter information.
Returns:
(Splitting object): Splitting subtype (TrainValidTestSplitting or KFoldSplitting) determined by params.split_strategy
Raises:
Exception: If params.split_strategy not in [‘train_valid_test’,’k_fold_cv’]. Unsupported split strategy
pipeline.splitting.select_attrs_by_dset_ids(dataset, attr_df)[source]

Returns a subset of the data frame attr_df selected by matching compound IDs in the index of attr_df against the ids in the dc.data.Dataset object dataset.

Args:

dataset (DiskDataset): The deepchem dataset, should have matching ids with ids in attr_df

attr_df (DataFrame): Contains the compound ids. Ids should match with dataset

Returns:
subattr_df (DataFrame): A subset of attr_df as determined by the ids in dataset
pipeline.splitting.select_attrs_by_dset_smiles(dataset, attr_df, smiles_col)[source]

Returns a subset of the data frame attr_df selected by matching SMILES strings in attr_df against the ids in the dc.data.Dataset object dataset.

Args:

dataset (DiskDataset): The deepchem dataset, should have matching ids with ids in attr_df

attr_df (DataFrame): Contains the compound ids. Ids should match with dataset. Should contain a column of SMILES strings under the smiles_col

smiles_col (str): Name of the column containing smiles strings

Returns:
subattr_df (DataFrame): A subset of attr_df as determined by the ids in dataset. Selected by matching SMILES strings in attr_df to the ids in the dataset
pipeline.splitting.select_dset_by_attr_ids(dataset, attr_df)[source]

Returns a subset of the given dc.data.Dataset object selected by matching compound IDs in the index of attr_df against the ids in the dataset.

Args:

dataset (DiskDataset): The deepchem dataset, should have matching ids with ids in attr_df

attr_df (DataFrame): Contains the compound ids to subset the dataset. Ids should match with dataset

Returns:
subset (DiskDataset): A subset of the deepchem dataset as determined by the ids in attr_df
pipeline.splitting.select_dset_by_id_list(dataset, id_list)[source]

Returns a subset of the given dc.data.Dataset object selected by matching compound IDs in the given list against the ids in the dataset.

Args:

dataset (DiskDataset): The deepchem dataset, should have matching ids with ids in id_list

id_list (list): Contains a list of compound ids to subset the dataset. Ids should match with dataset

Returns:
subset (DiskDataset): A subset of the deepchem dataset as determined by the ids in id_list

pipeline.transformations module

Classes providing different methods of transforming response data and/or features in datasets, beyond those provided by DeepChem.

class pipeline.transformations.NormalizationTransformerHybrid(transform_X=False, transform_y=False, transform_w=False, dataset=None, move_mean=True)[source]

Bases: deepchem.trans.transformers.NormalizationTransformer

Test extension to check for missing data

transform(dataset, parallel=False)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

dataset: dc.data.Dataset
Dataset object to be transformed.
parallel: bool, optional (default False)
if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir: str, optional
If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Dataset
A newly transformed Dataset object
transform_array(X, y, w, ids)[source]

Transform the data in a set of (X, y, w) arrays.

untransform(z, isreal=True)[source]

Undo transformation on provided data.

z: np.ndarray
Array to transform back
z_out: np.ndarray
Array with normalization undone.
class pipeline.transformations.NormalizationTransformerMissingData(transform_X=False, transform_y=False, transform_w=False, dataset=None, transform_gradients=False, move_mean=True)[source]

Bases: deepchem.trans.transformers.NormalizationTransformer

Test extension to check for missing data

transform(dataset, parallel=False)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

dataset: dc.data.Dataset
Dataset object to be transformed.
parallel: bool, optional (default False)
if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir: str, optional
If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Dataset
A newly transformed Dataset object
transform_array(X, y, w, ids)[source]

Transform the data in a set of (X, y, w) arrays.

class pipeline.transformations.UMAPTransformer(params, dataset)[source]

Bases: transformers.Transformer

Dimension reduction transformations using the UMAP algorithm.

Attributes:
mapper (UMAP) : UMAP transformer scaler (RobustScaler): Centering/scaling transformer
transform(dataset, parallel=False)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

dataset: dc.data.Dataset
Dataset object to be transformed.
parallel: bool, optional (default False)
if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir: str, optional
If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Dataset
A newly transformed Dataset object
transform_array(X, y, w, ids)[source]

Transform the data in a set of (X, y, w, ids) arrays.

X: np.ndarray
Array of features
y: np.ndarray
Array of labels
w: np.ndarray
Array of weights.
ids: np.ndarray
Array of identifiers.
Xtrans: np.ndarray
Transformed array of features
ytrans: np.ndarray
Transformed array of labels
wtrans: np.ndarray
Transformed array of weights
idstrans: np.ndarray
Transformed array of ids
untransform(z)[source]

Reverses stored transformation on provided data.

pipeline.transformations.create_feature_transformers(params, model_dataset)[source]

Fit a scaling and centering transformation to the feature matrix of the given dataset, and return a DeepChem transformer object holding its parameters.

Args:

params (argparse.namespace: Object containing the parameter list

model_dataset (ModelDataset): Contains the dataset to be transformed.

Returns:
(list of DeepChem transformer objects): list of transformers for the feature matrix
pipeline.transformations.create_weight_transformers(params, model_dataset)[source]

Fit an optional balancing transformation to the weight matrix of the given dataset, and return a DeepChem transformer object holding its parameters.

Args:

params (argparse.namespace: Object containing the parameter list

model_dataset (ModelDataset): Contains the dataset to be transformed.

Returns:
(list of DeepChem transformer objects): list of transformers for the weight matrix
pipeline.transformations.get_statistics_missing_ydata(dataset)[source]

Compute and return statistics of this dataset.

This updated version gives the option to check for and ignore missing values for the y variable only. The x variable still assumes no missing values.

pipeline.transformations.get_transformer_specific_metadata(params)[source]

Returns a dictionary of parameters related to the currently selected transformer(s).

Args:
params (argparse.namespace: Object containing the parameter list
Returns:
meta_dict (dict): Nested dictionary of parameters and values for each currently active transformer.
pipeline.transformations.transformers_needed(params)[source]

Returns a boolean indicating whether response and/or feature transformers would be created for a model with the given parameters.

Args:
params (argparse.namespace: Object containing the parameter list
Returns:
boolean: True if transformers are required given the model parameters.

Module contents