pipeline package¶
Submodules¶
pipeline.chem_diversity module¶
Functions to generate matrices or vectors of distances between compounds
- pipeline.chem_diversity.calc_dist_diskdataset(feat_type, dist_met, dataset1, dataset2=None, calc_type='nearest', num_nearest=1, **metric_kwargs)[source]¶
Returns an array of distances, either between all compounds in a single dataset or between two datasets, given as DeepChem Dataset objects.
- Args:
feat_type (str): How the data was featurized. Current options are ‘ECFP’ or ‘descriptors’.
dist_met (str): What distance metric to use. Current options include tanimoto, cosine, cityblock, euclidean, or any other metric supported by scipy.spatial.distance.pdist().
dataset1 (deepchem.Dataset): Dataset containing features of compounds to be compared.
dataset2 (deepchem.Dataset, optional): Second dataset, if two datasets are to be compared.
calc_type (str): Type of summarization to perform on rows of distance matrix. See function calc_summary for options.
num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.
metric_kwargs: Additional arguments to be passed to functions that calculate metrics.
- Returns:
np.ndarray: Vector or matrix of distances between feature vectors.
- pipeline.chem_diversity.calc_dist_feat_array(feat_type, dist_met, feat1, feat2=None, calc_type='nearest', num_nearest=1, **metric_kwargs)[source]¶
Returns a vector or array of distances, either between all compounds in a single dataset or between two datasets, given the feature matrices for the dataset(s).
- Args:
feat_type (str): How the data was featurized. Current options are ‘ECFP’ or ‘descriptors’.
dist_met (str): What distance metric to use. Current options include tanimoto, cosine, cityblock, euclidean, or any other metric supported by scipy.spatial.distance.pdist().
feat1: feature matrix as a numpy array
feat2: Optional, second feature matrix
calc_type (str): Type of summarization to perform on rows of distance matrix. See function calc_summary for options.
num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.
metric_kwargs: Additional arguments to be passed to functions that calculate metrics.
- Returns:
dists: vector or array of distances
- pipeline.chem_diversity.calc_dist_smiles(feat_type, dist_met, smiles_arr1, smiles_arr2=None, calc_type='nearest', num_nearest=1, **metric_kwargs)[source]¶
Returns an array of distances between compounds given as SMILES strings, either between all pairs of compounds in a single dataset or between two datasets.
- Args:
feat_type (str): How the data is to be featurized, if dist_met is not ‘mcs’. The only option supported currently is ‘ECFP’.
dist_met (str): What distance metric to use. Current options include ‘tanimoto’ and ‘mcs’.
smiles_arr1 (list): First list of SMILES strings.
smiles_arr2 (list): Optional, second list of SMILES strings. Can have only 1 member if wanting compound to matrix comparison.
calc_type (str): Type of summarization to perform on rows of distance matrix. See function calc_summary for options.
num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.
metric_kwargs: Additional arguments to be passed to functions that calculate metrics.
- Returns:
dists: vector or array of distances
- Todo:
Fix the function _get_descriptors(), which is broken, and re-enable the ‘descriptors’ option for feat_type. Will need to add a parameter to indicate what kind of descriptors should be computed.
Allow other metrics for ECFP features, as in calc_dist_diskdataset().
- pipeline.chem_diversity.calc_summary(dist_arr, calc_type, num_nearest=1, within_dset=False)[source]¶
Returns a summary of the distances in dist_arr, depending on calc_type.
- Args:
dist_arr: (np.array): Either a 2D distance matrix, or a 1D condensed distance matrix (flattened upper triangle).
calc_type (str): The type of summary values to return:
all: The distance matrix itself
nearest: The distances to the num_nearest nearest neighbors of each compound (except compound itself)
nth_nearest: The distance to the num_nearest’th nearest neighbor
avg_n_nearest: The average of the num_nearest nearest neighbor distances
farthest: The distance to the farthest neighbor
avg: The average of all distances for each compound
num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.
within_dset (bool): True if input distances are between compounds in the same dataset.
- Returns:
dists (np.array): A numpy array of distances. For calc_type ‘nearest’ with num_nearest > 1, this is a 2D array with a row for each compound; otherwise it is a 1D array.
- pipeline.chem_diversity.upload_distmatrix_to_DS(dist_matrix, feature_type, compound_ids, bucket, title, description, tags, key_values, filepath='./', dataset_key=None)[source]¶
Uploads distance matrix in the data store with the appropriate tags
- Args:
dist_matrix (np.ndarray): The distance matrix.
feature_type (str): How the data was featurized.
dist_met (str): What distance metric was used.
compound_ids (list): list of compound ids corresponding to the distance matrix (assumes that distance matrix is square and is the distance between all compounds in a dataset)
bucket (str): bucket the file will be put in
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): List of tags to assign to datastore object.
key_values (dict): Dictionary of key:value pairs to include in the datastore object’s metadata.
filepath (str): local path where you want to store the pickled dataframe
- dataset_key (str): If updating a file already in the datastore enter the corresponding dataset_key.
If not, leave as ‘none’ and the dataset_key will be automatically generated.
- Returns:
None
pipeline.compare_models module¶
Functions for comparing and visualizing model performance. Most of these functions rely on ATOM’s model tracker and datastore services, which are not part of the standard AMPL installation, but a few functions will work on collections of models saved as local files.
- pipeline.compare_models.copy_best_filesystem_models(result_dir, dest_dir, pred_type, force_update=False)[source]¶
Identify the best models for each dataset within a result directory tree (e.g. from a hyperparameter search). Copy the associated model tarballs to a destination directory.
- Args:
result_dir (str): Path to model training result directory.
dest_dir (str): Path of directory wherre model tarballs will be copied to.
pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to copy
force_update (bool): If true, overwrite tarball files that already exist in dest_dir.
- Returns:
pd.DataFrame: Table of performance metrics for best models.
- pipeline.compare_models.del_ignored_params(dictionary, ignored_params)[source]¶
Deletes ignored parameters from the dictionary if they exist
- Args:
dictionary (dict): A dictionary with parameters
ignored_parameters (list(str)): A list of keys potentially in the dictionary
- Returns:
None
- pipeline.compare_models.extract_collection_perf_metrics(collection_name, output_dir, pred_type='regression')[source]¶
Obtain list of training datasets with models in the given collection. Get performance metrics for models on each dataset and save them as CSV files in the given output directory.
- Args:
collection_name (str): Name of model tracker collection to search for models.
output_dir (str): Directory where tables of performance metrics will be written.
pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.
- Returns:
None
- pipeline.compare_models.extract_model_and_feature_parameters(metadata_dict)[source]¶
Given a config file, extract model and featurizer parameters. Looks for parameter names that end in *_specific. e.g. nn_specific, auto_featurizer_specific
- Args:
model_metadict (dict): Dictionary containing NON-FLATTENED metadata for an AMPL model
- Returns:
dictionary containing featurizer and model parameters. Most contain the following keys. [‘max_epochs’, ‘best_epoch’, ‘learning_rate’, ‘layer_sizes’, ‘dropouts’, ‘rf_estimators’, ‘rf_max_features’, ‘rf_max_depth’, ‘xgb_gamma’, ‘xgb_learning_rate’, ‘xgb_max_depth’, ‘xgb_colsample_bytree’, ‘xgb_subsample’, ‘xgb_n_estimators’, ‘xgb_min_child_weight’, ‘featurizer_parameters_dict’, ‘model_parameters_dict’]
- pipeline.compare_models.get_best_models_info(col_names=None, bucket='public', pred_type='regression', result_dir=None, PK_pipeline=False, output_dir='/usr/local/data', shortlist_key=None, input_dset_keys=None, save_results=False, subset='valid', metric_type=None, selection_type='max', other_filters={})[source]¶
Tabulate parameters and performance metrics for the best models, according to a given metric, trained against each specified dataset.
- Args:
col_names (list of str): List of model tracker collections to search.
bucket (str): Datastore bucket for training datasets.
pred_type (str): Type of models (regression or classification).
result_dir (list of str): Result directories of the models, if model tracker is not supported.
PK_pipeline (bool): Are we being called from PK pipeline?
output_dir (str): Directory to write output table to.
shortlist_key (str): Datastore key for table of datasets to query models for.
input_dset_keys (str or list of str): List of datastore keys for datasets to query models for. Either shortlist_key or input_dset_keys must be specified, but not both.
save_results (bool): If True, write the table of results to a CSV file.
subset (str): Input dataset subset (‘train’, ‘valid’, or ‘test’) for which metrics are used to select best models.
metric_type (str): Type of performance metric (r2_score, roc_auc_score, etc.) to use to select best models.
selection_type (str): Score criterion (‘max’ or ‘min’) to use to select best models.
other_filters (dict): Additional selection criteria to include in model query.
- Returns:
top_models_df (DataFrame): Table of parameters and metrics for best models for each dataset.
- pipeline.compare_models.get_best_perf_table(metric_type, col_name=None, result_dir=None, model_uuid=None, metadata_dict=None, PK_pipe=False)[source]¶
Extract parameters and training run performance metrics for a single model. The model may be specified either by a metadata dictionary, a model_uuid or a result directory; in the model_uuid case, the function queries the model tracker DB for the model metadata. For models saved in the filesystem, can query the performance data from the original result directory, but not from a saved tarball.
- Args:
metric_type (str): Performance metric to include in result dictionary.
col_name (str): Collection name containing model, if model is specified by model_uuid.
result_dir (str): result directory of the model, if Model tracker is not supported and metadata_dict not provided.
model_uuid (str): UUID of model to query, if metadata_dict is not provided.
metadata_dict (dict): Full metadata dictionary for a model, including training metrics and dataset metadata.
PK_pipe (bool): If True, include some additional parameters in the result dictionary specific to PK models.
- Returns:
model_info (dict): Dictionary of parameter or metric name - value pairs.
- Todo:
Add support for models saved as local tarball files.
- pipeline.compare_models.get_collection_datasets(collection_name)[source]¶
Returns a list of unique training datasets used for all models in a given collection.
- Args:
collection_name (str): Name of model tracker collection to search for models.
- Returns:
list: List of model training (dataset_key, bucket) tuples.
- pipeline.compare_models.get_dataset_models(collection_names, filter_dict={})[source]¶
Query the model tracker for all models saved in the model tracker DB under the given collection names. Returns a dictionary mapping (dataset_key,bucket) pairs to the list of (collection,model_uuid) pairs trained on the corresponding datasets.
- Args:
collection_names (list): List of names of model tracker collections to search for models.
filter_dict (dict): Additional filter criteria to use in model query.
- Returns:
dict: Dictionary mapping training set (dataset_key, bucket) tuples to (collection, model_uuid) pairs.
- pipeline.compare_models.get_filesystem_models(result_dir, pred_type)[source]¶
Identify all models in result_dir and create perf_result table with ‘tarball_path’ column containing a path to each tarball.
- pipeline.compare_models.get_filesystem_perf_results(result_dir, pred_type='classification')[source]¶
Retrieve metadata and performance metrics for models stored in the filesystem from a hyperparameter search run.
- Args:
result_dir (str): Root directory for results from a hyperparameter search training run.
pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.
- Returns:
pd.DataFrame: Table of metadata fields and performance metrics.
- pipeline.compare_models.get_multitask_perf_from_files(result_dir, pred_type='regression')[source]¶
Retrieve model metadata and performance metrics stored in the filesystem from a multitask hyperparameter search. Format the per-task performance metrics in a table with a row for each task and columns for each model/subset combination.
- Args:
result_dir (str): Path to root result directory containing output from a hyperparameter search run.
pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.
- Returns:
pd.DataFrame: Table of model metadata fields and performance metrics.
- pipeline.compare_models.get_multitask_perf_from_files_new(result_dir, pred_type='regression')[source]¶
Retrieve model metadata and performance metrics stored in the filesystem from a multitask hyperparameter search. Format the per-task performance metrics in a table with a row for each task and columns for each model/subset combination.
- Args:
result_dir (str): Path to root result directory containing output from a hyperparameter search run.
pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.
- Returns:
pd.DataFrame: Table of model metadata fields and performance metrics.
- pipeline.compare_models.get_multitask_perf_from_tracker(collection_name, response_cols=None, expand_responses=None, expand_subsets='test', exhaustive=False)[source]¶
Retrieve full metadata and metrics from model tracker for all models in a collection and format them into a table, including per-task performance metrics for multitask models.
Meant for multitask NN models, but works for single task models as well.
By AKP. Works for model tracker as of 10/2020
- Args:
collection_name (str): Name of model tracker collection to search for models.
- response_cols (list, str or None): Names of tasks (response columns) to query performance results for.
If None, checks to see if the entire collection has the same response cols. Otherwise, should be list of strings or a comma-separated string. asks for clarification. Note: make sure response cols are listed in same order as in metadata. Recommended: None first, then clarify.
- expand_responses (list, str or None): Names of tasks / response columns you want to include results for in
the final dataframe. Useful if you have a lot of tasks and only want to look at the performance of a few of them. Must also be a list or comma separated string, and must be a subset of response_cols. If None, will expand all responses.
- expand_subsets (list, str or None): Dataset subsets (‘train’, ‘valid’ and/or ‘test’) to show metrics for.
Again, must be list or comma separated string, or None to expand all.
- exhaustive (bool): If True, return large dataframe with all model tracker metadata minus any columns not
in expand_responses. If False, return trimmed dataframe with most relevant columns.
- Returns:
pd.DataFrame: Table of model metadata fields and performance metrics.
- pipeline.compare_models.get_summary_metadata_table(uuids, collections=None)[source]¶
Tabulate metadata fields and performance metrics for a set of models identified by specific model_uuids.
- Args:
uuids (list): List of model UUIDs to query.
- collections (list or str): Names of collections in model tracker DB to get models from. If collections is
a string, it must identify one collection to search for all models. If a list, it must be of the same length as uuids. If not provided, all collections will be searched.
- Returns:
pd.DataFrame: Table of metadata fields and performance metrics for models.
- pipeline.compare_models.get_summary_perf_tables(collection_names=None, filter_dict={}, result_dir=None, prediction_type='regression', verbose=False)[source]¶
Load model parameters and performance metrics from model tracker for all models saved in the model tracker DB under the given collection names (or result directory if Model tracker is not available) with the given prediction type. Tabulate the parameters and metrics including:
dataset (assay name, target, parameter, key, bucket) dataset size (train/valid/test/total) number of training folds model type (NN or RF) featurizer transformation type metrics: r2_score, mae_score and rms_score for regression, or ROC AUC for classification
- Args:
collection_names (list): Names of model tracker collections to search for models.
filter_dict (dict): Additional filter criteria to use in model query.
result_dir (str or list): Directories to search for models; must be provided if the model tracker DB is not available.
prediction_type (str): Type of models (classification or regression) to query.
verbose (bool): If true, print status messages as collections are processed.
- Returns:
pd.DataFrame: Table of model metadata fields and performance metrics.
- pipeline.compare_models.get_tarball_perf_table(model_tarball, pred_type='classification')[source]¶
Retrieve model metadata and performance metrics for a model saved as a tarball (.tar.gz) file.
- Args:
model_tarball (str): Path of model tarball file, named as model.tar.gz.
pred_type (str): Prediction type (‘classification’ or ‘regression’) of model.
- Returns:
tuple (pd.DataFrame, dict): Table of performance metrics and a dictionary of model metadata.
- pipeline.compare_models.get_training_datasets(collection_names)[source]¶
Query the model tracker DB for all the unique dataset keys and buckets used to train models in the given collections.
- Args:
collection_names (list): List of names of model tracker collections to search for models.
- Returns:
dict: Dictionary mapping collection names to lists of (dataset_key, bucket) tuples for training sets.
- pipeline.compare_models.get_training_perf_table(dataset_key, bucket, collection_name, pred_type='regression', other_filters={})[source]¶
Load performance metrics from model tracker for all models saved in the model tracker DB under a given collection that were trained against a particular dataset. Identify training parameters that vary between models, and generate plots of performance vs particular combinations of parameters.
- Args:
dataset_key (str): Training dataset key.
bucket (str): Training dataset bucket.
collection_name (str): Name of model tracker collection to search for models.
pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.
other_filters (dict): Other filter criteria to use in querying models.
- Returns:
pd.DataFrame: Table of models and performance metrics.
- pipeline.compare_models.num_trainable_parameters_from_file(tar_path)[source]¶
Return number of trainable paramters from tarfile
Given a tar file for a DeepChem model this will return the number of trainable parameters
- Args:
tar_path (str): Path to a DeepChem model
- Returns:
int: Number of trainable parameters.
- Raises:
ValueError: If the model is not a DeepChem neural network model
pipeline.dist_metrics module¶
Distance metrics for compounds: Tanimoto and maximum common substructure (MCS)
- pipeline.dist_metrics.mcs(mols1, mols2=None)[source]¶
Computes maximum common substructure (MCS) distances between pairs of molecules.
The MCS distance between molecules m1 and m2 is one minus the average of fMCS(m1,m2) and fMCS(m2,m1), where fMCS(m1,m2) is the fraction of m1’s atoms that are part of the largest common substructure of m1 and m2.
- Args:
mols1 (Sequence of rdkit.Mol): First list of molecules.
- mols2 (Sequence of rdkit.Mol, optional): Second list of molecules.
If not provided, computes MCS distances between pairs of molecules in mols1. Otherwise, computes a matrix of distances between pairs of molecules from mols1 and mols2.
- Returns:
np.ndarray: Matrix of pairwise distances between molecules.
- pipeline.dist_metrics.tanimoto(fps1, fps2=None)[source]¶
Compute Tanimoto distances between sets of ECFP fingerprints.
- Args:
fps1 (Sequence): First list of ECFP fingerprint vectors.
- fps2 (Sequence, optional): Second list of ECFP fingerprint vectors.
If not provided, computes distances between pairs of fingerprints in fps1. Otherwise, computes a matrix of distances between pairs of fingerprints in fps1 and fps2.
- Returns:
np.ndarray: Matrix of pairwise distances between fingerprints.
- pipeline.dist_metrics.tanimoto_single(fp, fps)[source]¶
Compute a vector of Tanimoto distances between a single fingerprint and each fingerprint in a list .
- Args:
fp : Fingerprint to be compared.
fps (Sequence): List of ECFP fingerprint vectors.
- Returns:
np.ndarray: Vector of distances between fp and each fingerprint in fps.
pipeline.diversity_plots module¶
Plotting routines for visualizing chemical diversity of datasets
- pipeline.diversity_plots.diversity_plots(dset_key, datastore=True, bucket='public', title_prefix=None, ecfp_radius=4, umap_file=None, out_dir=None, id_col='compound_id', smiles_col='rdkit_smiles', is_base_smiles=False, response_col=None, max_for_mcs=300, colorpal=None)[source]¶
Plot visualizations of diversity for an arbitrary table of compounds. At minimum, the file should contain columns for a compound ID and a SMILES string. Produces a clustered heatmap display of Tanimoto distances between compounds along with a 2D UMAP projection plot based on ECFP fingerprints, with points colored according to the response variable.
- Args:
dset_key (str): Datastore key or filepath for dataset.
datastore (bool): Whether to load dataset from datastore or from filesystem.
bucket (str): Name of datastore bucket containing dataset.
title_prefix (str): Prefix for plot titles.
ecfp_radius (int): Radius for ECFP fingerprint calculation.
umap_file (str, optional): Path to file to write UMAP coordinates to.
- out_dir (str, optional): Output directory for plots and tables. If provided, plots will be output as PDF files rather
than in the current notebook, and some additional CSV files will be generated.
id_col (str): Column in dataset containing compound IDs.
smiles_col (str): Column in dataset containing SMILES strings.
is_base_smiles (bool): True if SMILES strings do not need to be salt-stripped and standardized.
response_col (str): Column in dataset containing response values.
- max_for_mcs (int): Maximum dataset size for plots based on MCS distance. If the number of compounds is less than this
value, an additional cluster heatmap and UMAP projection plot will be produced based on maximum common substructure distance.
- pipeline.diversity_plots.plot_dataset_dist_distr(dataset, feat_type, dist_metric, task_name, **metric_kwargs)[source]¶
Generate a density plot showing the distribution of distances between dataset feature vectors, using the specified feature type and distance metric.
- Args:
dataset (deepchem.Dataset): A dataset object. At minimum, it should contain a 2D numpy array ‘X’ of feature vectors.
feat_type (str): Type of features (‘ECFP’ or ‘descriptors’).
dist_metric (str): Name of metric to be used to compute distances; can be anything supported by scipy.spatial.distance.pdist.
task_name (str): Abbreviated name to describe dataset in plot title.
metric_kwargs: Additional arguments to pass to metric.
- Returns:
np.ndarray: Distance matrix.
- pipeline.diversity_plots.plot_tani_dist_distr(df, smiles_col, df_name, radius=2, subset_col='subset', subsets=False, ref_subset='train', plot_width=6, ndist_max=None, **metric_kwargs)[source]¶
Generate a density plot showing the distribution of nearest neighbor distances between ecfp feature vectors, using the Tanimoto metric. Optionally split by subset.
- Args:
df (DataFrame): A data frame containing, at minimum, a column of SMILES strings.
smiles_col (str): Name of the column containing SMILES strings.
df_name (str): Name for the dataset, to be used in the plot title.
radius (int): Radius parameter used to calculate ECFP fingerprints. The default is 2, meaning that ECFP4 fingerprints are calculated.
subset_col (str): Name of the column containing subset names.
subsets (bool): If True, distances are only calculated for compounds not in the reference subset, and the distances computed are to the nearest neighbors in the reference subset.
ref_subset (str): Reference subset for nearest-neighbor distances, if subsets is True.
plot_width (float): Plot width in inches.
ndist_max (int): Not used, included only for backward compatibility.
metric_kwargs: Additional arguments to pass to metric. Not used, included only for backward compatibility.
- Returns:
dist (DataFrame): Table of individual nearest-neighbor Tanimoto distance values. If subsets is True, the table will include a column indicating the subset each compound belongs to.
pipeline.feature_importance module¶
Functions to assess feature importance in AMPL models
- pipeline.feature_importance.base_feature_importance(model_pipeline=None, params=None)[source]¶
Minimal baseline feature importance function. Given an AMPL model (or the parameters to train a model), returns a data frame with a row for each feature. The columns of the data frame depend on the model type and prediction type. If the model is a binary classifier, the columns include t-statistics and p-values for the differences between the means of the active and inactive compounds. If the model is a random forest, the columns will include the mean decrease in impurity (MDI) of each feature, computed by the scikit-learn feature_importances_ function. See the scikit-learn documentation for warnings about interpreting the MDI importance. For all models, the returned data frame will include feature names, means and standard deviations for each feature.
This function has been tested on RFs and NNs with rdkit descriptors. Other models and feature combinations may not be supported.
- Args:
model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.
params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.
- Returns:
- (imp_df, model_pipeline, pparams) (tuple):
imp_df (DataFrame): Table of feature importance metrics. model_pipeline (ModelPipeline): Pipeline object for model that was passed to or trained by function. pparams (Namespace): Parsed parameters of model.
- pipeline.feature_importance.cluster_permutation_importance(model_pipeline=None, params=None, score_type=None, clust_height=1, result_file=None, nreps=10, nworkers=1)[source]¶
Divide the input features used in a model into correlated clusters, then assess the importance of the features by iterating over clusters, permuting the values of all the features in the cluster, and measuring the effect on the model performance metric given by score_type for the training, validation and test subsets.
- Args:
model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.
params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.
clust_height (float): Height at which to cut the dendrogram branches to split features into clusters.
result_file (str): Path to a CSV file where a table of features and cluster indices will be written.
nreps (int): Number of repetitions of the permutation and rescoring procedure to perform for each feature; the importance values returned will be averages over repetitions. More repetitions will yield better importance estimates at the cost of greater computing time.
nworkers (int): Number of parallel worker threads to use for permutation and rescoring. Currently ignored; multithreading will be added in a future version.
- Returns:
imp_df (DataFrame): Table of feature clusters and importance values
- pipeline.feature_importance.display_feature_clusters(model_pipeline=None, params=None, clust_height=1, corr_file=None, show_matrix=False, show_dendro=True)[source]¶
Cluster the input features used in the model specified by model_pipeline or params, using Spearman correlation as a similarity metric. Display a dendrogram and/or a correlation matrix heatmap, so the user can decide the height at which to cut the dendrogram in order to split the features into clusters, for input to cluster_permutation_importance.
- Args:
model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.
params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.
clust_height (float): Height at which to draw a cut line in the dendrogram, to show how many clusters will be generated.
corr_file (str): Path to an optional CSV file to be created containing the feature correlation matrix.
show_matrix (bool): If True, plot a correlation matrix heatmap.
show_dendro (bool): If True, plot the dendrogram.
- Returns:
corr_linkage (np.ndarray): Linkage matrix from correlation clustering
- pipeline.feature_importance.permutation_feature_importance(model_pipeline=None, params=None, score_type=None, nreps=60, nworkers=1, result_file=None)[source]¶
Assess the importance of each feature used by a trained model by permuting the values of each feature in succession in the training, validation and test sets, making predictions, computing performance metrics, and measuring the effect of scrambling each feature on a particular metric.
- Args:
model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.
params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.
score_type (str): Name of the scoring metric to use to assess importance. This can be any of the standard values supported by sklearn.metrics.get_scorer; the AMPL-specific values ‘npv’, ‘mcc’, ‘kappa’, ‘mae’, ‘rmse’, ‘ppv’, ‘cross_entropy’, ‘bal_accuracy’ and ‘avg_precision’ are also supported. Score types for which smaller values are better, such as ‘mae’, ‘rmse’ and ‘cross_entropy’ are mapped to their negative counterparts.
nreps (int): Number of repetitions of the permutation and rescoring procedure to perform for each feature; the importance values returned will be averages over repetitions. More repetitions will yield better importance estimates at the cost of greater computing time.
nworkers (int): Number of parallel worker threads to use for permutation and rescoring.
result_file (str): Optional path to a CSV file to which the importance table will be written.
- Returns:
imp_df (DataFrame): Table of features and importance metrics. The table will include the columns returned by base_feature_importance, along with the permutation importance scores for each feature for the training, validation and test subsets.
- pipeline.feature_importance.plot_feature_importances(imp_df, importance_col='valid_perm_importance_mean', max_feat=20, ascending=False)[source]¶
Display a horizontal bar plot showing the relative importances of the most important features or feature clusters, according to the results of permutation_feature_importance, cluster_permutation_importance or a similar function.
- Args:
imp_df (DataFrame): Table of results from permutation_feature_importance, cluster_permutation_importance, base_feature_importance or a similar function.
importance_col (str): Name of the column in imp_df to plot values from.
max_feat (int): The maximum number of features or feature clusters to plot values for.
ascending (bool): Should the features be ordered by ascending values of importance_col? Defaults to False; can be set True for p-values or something else where small values mean greater importance.
- Returns:
None
pipeline.hyper_perf_plots module¶
Functions for visualizing hyperparameter performance. These functions work with a dataframe of model performance metrics and hyperparameter specifications from compare_models.py. For models on the tracker, use get_multitask_perf_from_tracker(). For models in the file system, use get_filesystem_perf_results(). By Amanda P. 7/19/2022
- pipeline.hyper_perf_plots.plot_nn_perf(df, scoretype='r2_score', subset='valid')[source]¶
This function plots scatterplots of performance scores based on their NN hyperparameters.
- Args:
df (pd.DataFrame): A dataframe containing model performances from a hyperparameter search. Best practice is to use get_multitask_perf_from_tracker() or get_filesystem_perf_results().
scoretype (str): the score type you want to use. Valid options can be found in hpp.classselmets or hpp.regselmets.
subset (str): the subset of scores you’d like to plot from ‘train’, ‘valid’ and ‘test’.
- pipeline.hyper_perf_plots.plot_rf_nn_xg_perf(df, scoretype='r2_score', subset='valid')[source]¶
This function plots boxplots of performance scores based on their hyperparameters including RF, NN and XGBoost parameters as well as feature types, model types and ECFP radius.
- Args:
df (pd.DataFrame): A dataframe containing model performances from a hyperparameter search. Best practice is to use get_multitask_perf_from_tracker() or get_filesystem_perf_results().
scoretype (str): the score type you want to use. Valid options can be found in hpp.classselmets or hpp.regselmets.
subset (str): the subset of scores you’d like to plot from ‘train’, ‘valid’ and ‘test’.
- pipeline.hyper_perf_plots.plot_rf_perf(df, scoretype='r2_score', subset='valid')[source]¶
This function plots scatterplots of performance scores based on their RF hyperparameters.
- Args:
df (pd.DataFrame): A dataframe containing model performances from a hyperparameter search. Best practice is to use get_multitask_perf_from_tracker() or get_filesystem_perf_results().
scoretype (str): the score type you want to use. Valid options can be found in hpp.classselmets or hpp.regselmets.
subset (str): the subset of scores you’d like to plot from ‘train’, ‘valid’ and ‘test’.
- pipeline.hyper_perf_plots.plot_split_perf(df, scoretype='r2_score', subset='valid')[source]¶
This function plots boxplots of performance scores based on the splitter type.
- Args:
df (pd.DataFrame): A dataframe containing model performances from a hyperparameter search. Best practice is to use get_multitask_perf_from_tracker() or get_filesystem_perf_results().
scoretype (str): the score type you want to use. Valid options can be found in hpp.classselmets or hpp.regselmets.
subset (str): the subset of scores you’d like to plot from ‘train’, ‘valid’ and ‘test’.
- pipeline.hyper_perf_plots.plot_train_valid_test_scores(df, scoretype='r2_score')[source]¶
This function plots kde and line plots of performance scores based on their partitions.
- Args:
df (pd.DataFrame): A dataframe containing model performances from a hyperparameter search. Best practice is to use get_multitask_perf_from_tracker() or get_filesystem_perf_results().
scoretype (str): the score type you want to use. Valid options can be found in hpp.classselmets or hpp.regselmets.
- pipeline.hyper_perf_plots.plot_xg_perf(df, scoretype='r2_score', subset='valid')[source]¶
This function plots scatterplots of performance scores based on their XG hyperparameters.
- Args:
df (pd.DataFrame): A dataframe containing model performances from a hyperparameter search. Best practice is to use get_multitask_perf_from_tracker() or get_filesystem_perf_results().
scoretype (str): the score type you want to use. Valid options can be found in hpp.classselmets or hpp.regselmets.
subset (str): the subset of scores you’d like to plot from ‘train’, ‘valid’ and ‘test’.
pipeline.model_pipeline module¶
Contains class ModelPipeline, which loads in a dataset, splits it, trains a model, and generates predictions and output metrics for that model. Works for a variety of featurizers, splitters and other parameters on a generic dataset
- class pipeline.model_pipeline.ModelPipeline(params, ds_client=None, mlmt_client=None)[source]¶
Bases:
object
Contains methods to load in a dataset, split and featurize the data, fit a model to the train dataset, generate predictions for an input dataset, and generate performance metrics for these predictions.
- Attributes:
- Set in __init__:
params (argparse.Namespace): The argparse.Namespace parameter object
log (log): The logger
run_mode (str): A flag determine the mode of model pipeline (eg. training or prediction)
params.dataset_name (argparse.Namespace): The dataset_name parameter of the dataset
ds_client (ac.DatastoreClient): the datastore api token to interact with the datastore
perf_dict (dict): The performance dictionary
output_dir (str): The parent path of the model directory
mlmt_client: The mlmt service client
metric_type (str): Defines the type of metric (e.g. roc_auc_score, r2_score)
- set in train_model or run_predictions:
run_mode (str): The mode to run the pipeline, set to training
featurziation (Featurization object): The featurization argument or the featurizatioin created from the input parameters
model_wrapper (ModelWrapper objct): A model wrapper created from the parameters and featurization object.
- set in create_model_metadata:
model_metadata (dict): The model metadata dictionary that stores the model metrics and metadata
- Set in load_featurize_data
data (ModelDataset object): A data object that featurizes and splits the dataset
- calc_train_dset_pair_dis(metric='euclidean')[source]¶
Calculate the pairwise distance for training set compound feature vectors, needed for AD calculation.
- create_model_metadata()[source]¶
Initializes a data structure describing the current model, to be saved in the model zoo. This should include everything necessary to reproduce a model run.
- Side effects:
Sets self.model_metadata (dictionary): A dictionary of the model metadata required to recreate the model. Also contains metadata about the generating dataset.
- create_prediction_metadata(prediction_results)[source]¶
Initializes a data structure to hold performance metrics from a model run on a new dataset, to be stored in the model tracker DB. Note that this isn’t used for the training run metadata; the training_metrics section is created by the train_model() function.
- Returns:
prediction_metadata (dict): A dictionary of the metadata for a model run on a new dataset.
- get_metrics()[source]¶
Retrieve the model performance metrics from any previous training and prediction runs from the model tracker
- load_featurize_data(params=None)[source]¶
Loads the dataset from the datastore or the file system and featurizes it. If we are training a new model, split the dataset into training, validation and test sets.
The data is also split into training, validation, and test sets and saved to the filesystem or datastore.
Assumes a ModelWrapper object has already been created.
- Args:
params (Namespace): Optional set of parameters to be used for featurization; by default this function uses the parameters used when the pipeline was created.
- Side effects:
- Sets the following attributes of the ModelPipeline
- data (ModelDataset object): A data object that featurizes and splits the dataset
data.dataset(dc.DiskDataset): The transformed, featurized, and split dataset
- predict_embedding(dset_df, dset_params=None)[source]¶
Compute embeddings from a pretrained model on a set of compounds listed in a data frame. The data frame should contain, at minimum, a column of compound IDs and a column of SMILES strings.
- predict_full_dataset(dset_df, is_featurized=False, contains_responses=False, dset_params=None, AD_method=None, k=5, dist_metric='euclidean', max_train_records_for_AD=1000)[source]¶
Compute predicted responses from a pretrained model on a set of compounds listed in a data frame. The data frame should contain, at minimum, a column of compound IDs; if SMILES strings are needed to compute features, they should be provided as well. Feature columns may be provided as well. If response columns are included in the input, they will be included in the output as well to facilitate performance metric calculations.
This function is similar to predict_on_dataframe, except that it supports multitask models, and includes class probabilities in the output for classifier models.
- Args:
dset_df (DataFrame): A data frame containing compound IDs (if the compounds are to be featurized using descriptors) and/or SMILES strings (if the compounds are to be featurized using ECFP fingerprints or graph convolution) and/or precomputed features. The column names for the compound ID and SMILES columns should match id_col and smiles_col, respectively, in the model parameters.
is_featurized (bool): True if dset_df contains precomputed feature columns. If so, dset_df must contain all of the feature columns defined by the featurizer that was used when the model was trained.
contains_responses (bool): True if dataframe contains response values
dset_params (Namespace): Parameters used to interpret dataset, including id_col, smiles_col, and optionally, response_cols. If not provided, id_col, smiles_col and response_cols are assumed to be same as in the pretrained model.
AD_method (str or None): Method to use to compute applicability domain (AD) index; may be ‘z_score’, ‘local_density’ or None (the default). With the default value, AD indices will not be calculated.
k (int): Number of nearest neighbors of each training data point used to evaluate the AD index.
dist_metric (str): Metric used to compute distances between feature vectors for AD index calculation. Valid values are ‘cityblock’, ‘cosine’, ‘euclidean’, ‘jaccard’, and ‘manhattan’. If binary features such as fingerprints are used in model, ‘jaccard’ (equivalent to Tanimoto distance) may be a better choice than the other metrics which operate on continuous features.
max_train_records_for_AD (int): Maximum number of training data rows to use for AD calculation. Note that the AD calculation time scales as the square of the number of training records used. If the training dataset is larger than max_train_records_for_AD, a random sample of rows with this size is used instead for the AD calculations.
- Returns:
result_df (DataFrame): Data frame indexed by compound IDs containing a column of SMILES strings, with additional columns containing the predicted values for each response variable. If the model was trained to predict uncertainties, the returned data frame will also include standard deviation columns (named <response_col>_std) for each response variable. The result data frame may not include all the compounds in the input dataset, because the featurizer may not be able to featurize all of them.
- predict_on_dataframe(dset_df, is_featurized=False, contains_responses=False, AD_method=None, k=5, dist_metric='euclidean')[source]¶
DEPRECATED Call predict_full_dataset instead.
- predict_on_smiles(smiles, verbose=False, AD_method=None, k=5, dist_metric='euclidean')[source]¶
Compute predicted responses from a pretrained model on a set of compounds given as a list of SMILES strings.
- Args:
smiles (list): A list containting valid SMILES strings
verbose (boolean): A switch for disabling informational messages
AD_method (str or None): Method to use to compute applicability domain (AD) index; may be ‘z_score’, ‘local_density’ or None (the default). With the default value, AD indices will not be calculated.
k (int): Number of nearest neighbors of each training data point used to evaluate the AD index.
dist_metric (str): Metric used to compute distances between feature vectors for AD index calculation. Valid values are ‘cityblock’, ‘cosine’, ‘euclidean’, ‘jaccard’, and ‘manhattan’. If binary features such as fingerprints are used in model, ‘jaccard’ (equivalent to Tanimoto distance) may be a better choice than the other metrics which operate on continuous features.
- Returns:
res (DataFrame): Data frame indexed by compound IDs containing a column of SMILES strings, with additional columns containing the predicted values for each response variable. If the model was trained to predict uncertainties, the returned data frame will also include standard deviation columns (named <response_col>_std) for each response variable. The result data frame may not include all the compounds in the input dataset, because the featurizer may not be able to featurize all of them.
- run_predictions(featurization=None)[source]¶
Instantiate a previously trained model, and use it to run predictions on a new dataset.
Generate predictions for a specified dataset, and save the predictions and performance metrics in the model results DB or in a JSON file.
- Args:
featurization (Featurization Object): An optional featurization object for creating the model wrappr
- Side effects:
- Sets the following attributes of ModelPipeline:
run_mode (str): The mode to run the pipeline, set to prediction
featurization (Featurization object): The featurization argument or the featurization created from the input parameters
model_wrapper (ModelWrapper object): A model wrapper created from the parameters and featurization object.
- save_metrics(model_metrics, prefix=None, retries=5, sleep_sec=60)[source]¶
Saves the given model_metrics dictionary to a JSON file on disk, and also to the model tracker database if we’re using it.
If writing to disk, outputs to a JSON file <prefix>_model_metrics.json in the current output directory.
- Args:
model_metrics (dict or list): Either a dictionary containing the model performance metrics, or a list of dictionaries with metrics for each training label and subset.
prefix (str): An optional prefix to include in the JSON filename
retries (int): Number of retries to save to model tracker DB, if save_results is True.
sleep_sec (int): Number of seconds to sleep between retries.
- Side effects:
Saves the model_metrics dictionary to the model tracker database, or writes out a .json file
- save_model_metadata(retries=5, sleep_sec=60)[source]¶
Saves the data needed to reload the model in the model tracker DB or in a local tarball file.
Inserts the model metadata into the model tracker DB, if self.params.save_results is True. Otherwise, saves the model metadata to a local .json file. Generates a gzipped tar archive containing the metadata file, the transformer parameters and the model checkpoint files, and saves it in the datastore or the filesystem according to the value of save_results.
- Args:
retries (int): Number of times to retry saving to model tracker DB.
sleep_sec (int): Number of seconds to sleep between retries, if saving to model tracker DB.
- Side effects:
Saves the model metadata and parameters into the model tracker DB or a local tarball file.
- split_dataset(featurization=None)[source]¶
Load, featurize and split the dataset according to the current model parameter settings, but don’t actually train a model. Returns the split_uuid for the dataset split.
- Args:
featurization (Featurization object): An optional featurization object.
- Returns:
split_uuid (str): The unique identifier for the dataset split.
- train_model(featurization=None)[source]¶
Build model described by self.params on the training dataset described by self.params.
Generate predictions for the training, validation, and test datasets, and save the predictions and performance metrics in the model results DB or in a JSON file.
- Args:
featurization (Featurization object): An optional featurization object for creating models on a prefeaturized dataset
- Side effects:
- Sets the following attributes of the ModelPipeline object
run_mode (str): The mode to run the pipeline, set to training
featurization (Featurization object): The featurization argument or the featurization created from the input parameters
model_wrapper (ModelWrapper objct): A model wrapper created from the parameters and featurization object.
model_metadata (dict): The model metadata dictionary that stores the model metrics and metadata
- pipeline.model_pipeline.build_dataset_name(dataset_key)[source]¶
Return the dataset_name when given a dataset_key. Assumes that the dataset_name is a path and ends with an extension
- Args:
dataset_key (str): A dataset_key
- Returns:
The dataset_name which is the base name stripped of extensions
- pipeline.model_pipeline.build_tarball_name(dataset_name, model_uuid, result_dir='')[source]¶
- format for building model tarball names
Creates the file name for a model tarball from dataset key and model_uuid with optional result_dir.
- Args:
dataset_name (str): The dataset_name used to train this model model_uuid (str): The model_uuid assigned to this model result_dir (str): Optional directory for this model
- Returns:
The path or filename of the tarball for this model
- pipeline.model_pipeline.calc_AD_kmean_dist(train_dset, pred_dset, k, train_dset_pair_distance=None, dist_metric='euclidean')[source]¶
calculate the probability of the prediction dataset fall in the the domain of traning set. Use Euclidean distance of the K nearest neighbours. train_dset and pred_dset should be in 2D numpy array format where each row is a compound.
- pipeline.model_pipeline.calc_AD_kmean_local_density(train_dset, pred_dset, k, train_dset_pair_distance=None, dist_metric='euclidean')[source]¶
Evaluate the AD of pred data by comparing the distance betweenthe unseen object and its k nearest neighbors in the training set to the distance between these k nearest neighbors and their k nearest neighbors in the training set. Return the distance ratio. Greater than 1 means the pred data is far from the domain.
- pipeline.model_pipeline.create_prediction_pipeline(params, model_uuid, collection_name=None, featurization=None, alt_bucket='CRADA')[source]¶
Create a ModelPipeline object to be used for running blind predictions on datasets where the ground truth is not known, given a pretrained model in the model tracker database.
- Args:
params (Namespace or dict): A parsed parameters namespace, containing parameters describing how input datasets should be processed. If a dictionary is passed, it will be parsed to fill in default values and convert it to a Namespace object.
model_uuid (str): The UUID of a trained model.
collection_name (str): The collection where the model is stored in the model tracker DB.
featurization (Featurization): An optional featurization object to be used for featurizing the input data. If none is provided, one will be created based on the stored model parameters.
alt_bucket (str): Alternative bucket to search for model tarball and transformer files, if original bucket no longer exists.
- Returns:
pipeline (ModelPipeline): A pipeline object to be used for making predictions.
- pipeline.model_pipeline.create_prediction_pipeline_from_file(params, reload_dir, model_path=None, model_type='best_model', featurization=None, verbose=True)[source]¶
Create a ModelPipeline object to be used for running blind predictions on datasets, given a pretrained model stored in the filesystem. The model may be stored either as a gzipped tar archive or as a directory.
- Args:
params (Namespace): A parsed parameters namespace, containing parameters describing how input datasets should be processed.
- reload_dir (str): The path to the parent directory containing the various model subdirectories
(e.g.: ‘/home/cdsw/model/delaney-processed/delaney-processed/pxc50_NN_graphconv_scaffold_regression/’).
If reload_dir is None, then model_path must be specified. If both are specified, then the tar archive given by model_path will be unpacked into reload_dir, possibly overwriting existing files in that directory.
model_path (str): Path to a gzipped tar archive containing the saved model metadata and parameters. If specified, the tar archive is unpacked into reload_dir if that directory is given, or to a temporary directory otherwise.
model_type (str): Name of the subdirectory in reload_dir or in the tar archive where the trained model state parameters should be loaded from.
featurization (Featurization): An optional featurization object to be used for featurizing the input data. If none is provided, one will be created based on the stored model parameters.
- Returns:
pipeline (ModelPipeline): A pipeline object to be used for making predictions.
- pipeline.model_pipeline.ensemble_predict(model_uuids, collections, dset_df, labels=None, dset_params=None, splitters=None, mt_client=None, aggregate='mean', contains_responses=False)[source]¶
Load a series of pretrained models and predict responses with each model; then aggregate the predicted responses into one prediction per compound.
- Args:
model_uuids (iterable of str): Sequence of UUIDs of trained models.
collections (str or iterable of str): The collection(s) where the models are stored in the model tracker DB. If a single string, the same collection is assumed to contain all the models. Otherwise, collections should be of the same length as model_uuids.
dset_df (DataFrame): Dataset to perform predictions on. Should contain compound IDs and SMILES strings. May contain features.
labels (iterable of str): Optional suffixes for model-specific prediction column names. If not provided, the columns are labeled ‘pred_<uuid>’ where <uuid> is the model UUID.
dset_params (Namespace): Parameters used to interpret dataset, including id_col and smiles_col. If not provided, id_col and smiles_col are assumed to be same as in the pretrained model and the same for all models.
mt_client: Ignored, for backward compatibility only.
aggregate (str): Method to be used to combine predictions.
- Returns:
pred_df (DataFrame): Table with predicted responses from each model, plus the ensemble prediction.
- pipeline.model_pipeline.load_from_tracker(model_uuid, collection_name=None, client=None, verbose=False, alt_bucket='CRADA')[source]¶
DEPRECATED. Use the function create_prediction_pipeline() directly, or use the higher-level function predict_from_model.predict_from_tracker_model().
Create a ModelPipeline object using the metadata in the model tracker.
- Args:
model_uuid (str): The UUID of a trained model.
collection_name (str): The collection where the model is stored in the model tracker DB.
client : Ignored, for backward compatibility only
verbose (bool): A switch for disabling informational messages
alt_bucket (str): Alternative bucket to search for model tarball and transformer files, if original bucket no longer exists.
- Returns:
- tuple of:
pipeline (ModelPipeline): A pipeline object to be used for making predictions.
pparams (Namespace): Parsed parameter namespace from the requested model.
- pipeline.model_pipeline.regenerate_results(result_dir, params=None, metadata_dict=None, shared_featurization=None, system='twintron-blue')[source]¶
Query the model tracker for models matching the criteria in params.model_filter. Run predictions with each model using the dataset specified by the remaining parameters.
- Args:
result_dir (str): Parent of directory where result files will be written
params (Namespace): Parsed parameters
metadata_dict (dict): Model metadata
shared_featurization (Featurization): Object to map compounds to features, shared across models. User is responsible for ensuring that shared_featurization is compatible with all matching models.
system (str): System name
- Returns:
result_dict (dict): Results from predictions
- pipeline.model_pipeline.retrain_model(model_uuid, collection_name=None, result_dir=None, mt_client=None, verbose=True)[source]¶
Obtain model parameters from the metadata in the model tracker, given the model_uuid, and train a new model using exactly the same parameters (except for result_dir). Returns the resulting ModelPipeline object. The pipeline object can then be used as input for performance plots and other analyses that can’t be done using just the metrics stored in the model tracker; or to make predictions on new data.
- Args:
model_uuid (str): The UUID of a trained model.
collection_name (str): The collection where the model is stored in the model tracker DB.
result_dir (str): The directory of model results when the model tracker is not available.
mt_client : Ignored
verbose (bool): A switch for disabling informational messages
- Returns:
pipeline (ModelPipeline): A pipeline object containing data from the model training.
- pipeline.model_pipeline.run_models(params, shared_featurization=None, generator=False)[source]¶
Query the model tracker for models matching the criteria in params.model_filter. Run predictions with each model using the dataset specified by the remaining parameters.
- Args:
params (Namespace): Parsed parameters
shared_featurization (Featurization): Object to map compounds to features, shared across models. User is responsible for ensuring that shared_featurization is compatible with all matching models.
generator (bool): True if run as a generator
pipeline.model_tracker module¶
Module to interface model pipeline to model tracker service.
- pipeline.model_tracker.convert_metadata(old_metadata)[source]¶
Convert model metadata from old format (with camel-case parameter group names) to new format.
- Args:
old_metadata (dict): Model metadata in old format
- Returns:
new_metadata (dict): Model metadata in new format
- pipeline.model_tracker.export_model(model_uuid, collection, model_dir, alt_bucket='CRADA')[source]¶
Export the metadata (parameters) and other files needed to recreate a model from the model tracker database to a gzipped tar archive.
- Args:
model_uuid (str): Model unique identifier
collection (str): Name of the collection holding the model in the database.
model_dir (str): Path to directory where the model metadata and parameter files will be written. The directory will be created if it doesn’t already exist. Subsequently, the directory contents will be packed into a gzipped tar archive named model_dir.tar.gz.
alt_bucket (str): Alternate datastore bucket to search for model tarball and transformer objects.
- Returns:
none
- pipeline.model_tracker.extract_datastore_model_tarball(model_uuid, model_bucket, output_dir, model_dir)[source]¶
Load a model tarball saved in the datastore and check the format. If it is a new style tarball (containing the model metadata and transformers along with the model state), unpack it into output_dir. Otherwise it contains the model state only; unpack it into model_dir.
- Args:
model_uuid (str): UUID of model to be retrieved
model_bucket (str): Datastore bucket containing model tarball file
output_dir (str): Output directory to unpack tarball into if it’s in the new format
model_dir (str): Output directory to unpack tarball into if it’s in the old format
- Returns:
extract_dir (str): The directory (output_dir or model_dir) the tarball was extracted into.
- pipeline.model_tracker.get_full_metadata(filter_dict, collection_name=None)[source]¶
Retrieve relevant full metadata (including training run metrics) of models matching given criteria.
- Args:
filter_dict (dict): dictionary to filter on
collection_name (str): Name of collection to search
- Returns:
A list of matching full model metadata (including training run metrics) dictionaries. Raises MongoQueryException if the query fails.
- pipeline.model_tracker.get_full_metadata_by_uuid(model_uuid, collection_name=None)[source]¶
Retrieve model parameter metadata for the given model_uuid and collection. The returned metadata dictionary will include training run performance metrics and training dataset metadata.
- Args:
model_uuid (str): model unique identifier
collection_name(str): collection to search (optional, searches all collections if not specified)
- Returns:
Matching metadata dictionary. Raises MongoQueryException if the query fails.
- pipeline.model_tracker.get_metadata_by_uuid(model_uuid, collection_name=None)[source]¶
Retrieve model parameter metadata by model_uuid. The resulting metadata dictionary can be passed to parameter_parser.wrapper(); it does not contain performance metrics or training dataset metadata.
- Args:
model_uuid (str): model unique identifier
collection_name(str): collection to search (optional, searches all collections if not specified)
- Returns:
Matching metadata dictionary. Raises MongoQueryException if the query fails.
- pipeline.model_tracker.get_model_collection_by_uuid(model_uuid, mlmt_client=None)[source]¶
Retrieve model collection given a uuid.
- Args:
model_uuid (str): model uuid
mlmt_client: Ignored
- Returns:
Matching collection name
- Raises:
ValueError if there is no collection containing a model with the given uuid.
- pipeline.model_tracker.get_model_training_data_by_uuid(uuid)[source]¶
Retrieve data used to train, validate, and test a model given the uuid
- Args:
uuid (str): model uuid
- Returns:
a tuple of datafraes containint training data, validation data, and test data including the compound ID, RDKIT SMILES, and response value
- pipeline.model_tracker.save_model(pipeline, collection_name='model_tracker', log=True)[source]¶
Save the model.
Save the model files to the datastore and save the model metadata dict to the Mongo database.
- Args:
pipeline (ModelPipeline object): the pipeline to use
collection_name (str): the name of the Mongo DB collection to use
log (bool): True if logs should be printed, default False
use_personal_client (bool): True if personal client should be used (i.e. for testing), default False
- Returns:
None if insertion was successful, raises DatastoreInsertionException, MLMTClientInstantiationException or MongoInsertionException otherwise
- pipeline.model_tracker.save_model_tarball(output_dir, model_tarball_path)[source]¶
Save the model parameters, metadata and transformers as a portable gzipped tar archive.
- Args:
output_dir (str): Output directory from model training
model_tarball_path (str): Path of tarball file to be created
- Returns:
None
pipeline.parameter_parser module¶
- class pipeline.parameter_parser.AutoArgumentAdder(func, prefix)[source]¶
Bases:
object
Finds, manages, and adds all parameters of an object to a argparse parser
AutoArgumentAdder recursively finds all keyword arguments of a given object. A prefix is added to each keyword argument to prevent collisions and help distinguish automatically added arguments from normal arguments.
- Attributes:
func (object): The original object e.g. dcm.AttentiveFPModel funcs (List[object]): A list of parents. e.g. KerasModel prefix (str): A prefix for arguments. e.g. ‘AttentiveFPModel’ types (dict): A mapping between parameter names and types. Prefixes
are not used in the keys.
- used_by (dict): A mapping between parameter names (no prefix) and
the object or objects that use that parameter.
args (set): A set of all argument names
- add_to_parser(parser)[source]¶
Adds expected parameters to an argparse.ArgumentParser. Checks to see if the argument has synonyms e.g. mode and prediction_type and sets dest accordingly. All parameters have default=None, this is checked later in self.extract_params. None parameters are not passed on so we can use default parameters set by DeepChem.
- Args:
parser (argparse.ArgumentParser): An argument parser
- Returns:
None
- all_prefixed_names()[source]¶
Returns a list of all argument names with prefixes added
- Args:
None
- Returns:
List[str]: A list of all arguments with prefix added
- extract_params(params, strip_prefix=False)[source]¶
Extracts non-None parameters from the given Namespace.
- Args:
params (Namespace): Parameters. strip_prefix (bool): Strips off the prefix of the parameter. e.g.
AttentiveFP_mode becomes mode
- Returns:
- dict: Dictionary containing a subset of parameters that are expected
by this function.
- get_list_args()[source]¶
Returns a list of arguments that accept a List
- Args:
None
- Returns:
List[str]: A list of prefixed argument names that will accept a List
- pipeline.parameter_parser.all_auto_arguments()[source]¶
Returns a set of all arguments that get automatically added
- Args:
None
- Returns:
set: A set of all arguments that were automatically added.
- pipeline.parameter_parser.all_auto_float_lists()[source]¶
Returns a set of all arguments that are automatically added and accept a list of float.
- Args:
None
- Returns:
A set of automatically added arguments that accept a list of floats
- pipeline.parameter_parser.all_auto_int_lists()[source]¶
Returns a set of all arguments that are automatically added and accept a list of ints.
- Args:
None
- Returns:
- set: A set of automatically added arugments that could accept a
list of ints.
- pipeline.parameter_parser.all_auto_lists()[source]¶
Returns a set of all arguments that get automatically added and are lists
- Args:
None
- Returns:
set: A set of automatically added arguments that accept a list.
- pipeline.parameter_parser.dict_to_list(inp_dictionary, replace_spaces=False)[source]¶
Method to convert dictionary to a modified list of strings for input to argparse. Adds a ‘–’ in front of keys in the dictionary.
- Args:
inp_dictionary (dict): Flat dictionary of parameters
replae_spaces (bool): A flag for replace spaces with replace_spaces_str for handling spaces in command line.
- Returns:
(list): a list of default parameters + user specified parameters
None if inp_dictionary is None
- pipeline.parameter_parser.extract_featurizer_params(params, strip_prefix=True)[source]¶
Extracts parameters meant for a specific featurizer. Use only for arguments automatically added by an AutoArgumentAdder
- Args:
params (Namespace): Parameter Namespace strip_prefix (bool): Automatically added parameters come with a prefix.
When True, the prefix is removed. e.g. MolGraphConvFeaturizer_use_edges becomes use_edges
- Returns:
- dict: A subset of parameters from params that should be passed on to the
featurizer
- pipeline.parameter_parser.extract_model_params(params, strip_prefix=True)[source]¶
Extracts parameters meant for a specific model. Use only for arguments automatically added by an AutoArgumentAdder
- Args:
params (Namespace): Parameter Namespace strip_prefix (bool): Automatically added parameters come with a prefix.
When True, the prefix is removed. e.g. AttentiveFP_mode becomes mode
- Returns:
- dict: A subset of parameters from params that should be passed on to the
model
- pipeline.parameter_parser.flatten_dict(inp_dict, newdict={})[source]¶
Method to flatten a hierarchical dictionary. Used in parse_config_file(). Throws error if there are duplicated keys in the dictionary. WARNING: immediately throws error upon first detection of duplications.
- Args:
inp_dict(dict): hierarchical dictionary
newdict(empty dict): empty dictionary, name of output flattened dictionary
- Returns:
newdict(dict): Flattened dictionary.
- pipeline.parameter_parser.get_parser()[source]¶
Method that performs the actual parsing of pre-processed parameters. Modify this method to add/change/remove parameters
Args: None
- Returns:
parser (argparse.Namespace): an object containing default parameters + user specific parameters
- pipeline.parameter_parser.is_list(p, type_annotation)[source]¶
Given paramter name and annotation, returns true if it accepts a list
Returns False on generic list will only return true for ‘typing.List’ or <class ‘list’>
Performs recursive earch in case of typing.Union
- Args:
p (str): A parameter name.
- type_annotation (object): This is a type annotation returned by the inspect
module
- Returns:
boolean: If this annotation will accept a List
- pipeline.parameter_parser.is_list_float(p, type_annotation)[source]¶
Given paramter name and annotation, returns true if it accepts a float list
Returns False on generic list will only return true for ‘typing.List[float]’
Performs recursive earch in case of typing.Union
- Args:
p (str): A parameter name.
- type_annotation (object): This is a type annotation returned by the inspect
module
- Returns:
boolean: If this annotation will accept a List[float]
- pipeline.parameter_parser.is_list_int(p, type_annotation)[source]¶
Given parameter name and annotation, returns true if this accepts an integer list
Returns False on generic list will only return true for ‘typing.List[int]’
Performs recursive earch in case of typing.Union
- Args:
p (str): A parameter name.
- type_annotation (object): This is a type annotation returned by the inspect
module
- Returns:
boolean: If this annotation will accept a List[int]
- pipeline.parameter_parser.is_primative_type(t)[source]¶
Returns true if t is of type int, str, or float
- Args:
t (type): A type
- Returns:
bool. True if type is int, str, or float
- pipeline.parameter_parser.list_defaults(hyperparam=False)[source]¶
Creates temporary required variables, to generate a Namespace.argparse object of defaults.
- Returns:
argparse.Namespace: a Namespace.argparse object containing default parameters + user specified parameters
- pipeline.parameter_parser.make_dataset_key_absolute(parsed_args)[source]¶
Converts dataset_key to an aboslute path
- Args:
params (argparse.Namespace): Raw parsed arguments.
- pipeline.parameter_parser.parse_command_line(args=None)[source]¶
Parses a command line argument or a specifically formatted list of strings into a Namespace.argparse object.
- String input is in the following format:
args = [’–arg1’,’val1’,’–arg2’,’val2’,’–arg3’,’val3’]
- Args:
args(None or list): If args is none, parse_command_line parses sys.argv if it . If args is a list, the list is parsed
- Returns:
parsed_args (argparse.Namespace): an object containing default parameters + user specific parameters
- pipeline.parameter_parser.parse_config_file(config_file_path)[source]¶
Method to convert a .json configuration file to a Namespace object. Does the following conversions: .json -> hierarchical dict -> flat dict -> dict_to_list. WARNING: if there are two identical parameters on the same hierarchical level in the config.json, the .json will inherently silence the parameter higher up on the list without flagging a duplication. However, duplicate parameters in two different hierarchies or subdictionaries will be flagged by this parser.
- Args:
config_file_path(str): PATH to configuration .json file
- Returns:
argparse.Namespace: a Namespace.argparse object containing default parameters + user specified parameters
- pipeline.parameter_parser.parse_namespace(namespace_params=None)[source]¶
Method to convert namespace object to dictionary, then pass the value to dict_to_list. Will simply pass a dictionary
- Args:
namespace_params(dictionary or namespace.argparse object)
- Returns:
argparse.Namespace: a Namespace.argparse object containing default parameters + user specified parameters
- pipeline.parameter_parser.postprocess_args(parsed_args)[source]¶
Postprocessing for the parsed arguments. Replaces any string in null_options with a NoneType
Replaces any string that matches replace_with_space with whitespace.
Parses arguments in convert_to_float_list into a list of floats, if the hyperparams option is True. E.g. parsed_args.dropouts = “0.001,0.001 0.002,0.002 0.03,003”
-> parsed_args.dropouts = [[0.001,0.001], [0.002,0.002], [0.03,003]]
Parses arguments in convert_to_int_list into a list of ints, if the hyperparams options is True. E.g. parsed_args.layer_sizes = “10,100 20,200 30,300”
-> parsed_args.layer_sizes = [[10,100], [20,200], [30,300]]
Parameters in keep_as_list are kept as lists, even if there is a single item in the list.
Parameters in convert_to_str_list are converted to a list of strings. E.g. parsed_args.model_type = “NN,RF”
-> parsed_args.model_type = [‘NN’,’RF’].
If there is a single item in the list (no commas), the repsonse is kept as a StringType, unless it is in response_cols, which is passed as a list
Setting conditional options for descriptor_key.
Set uncertainty to False when using XGBoost because GBoost does not support uncertainty
- Args:
parsed_args (argparse.Namespace): Raw parsed arguments.
- Returns:
parsed_args (argparse.Namespace): a argparse.Namespace object containing properly processed arguments.
- Raises:
Exception: layer_sizes, dropouts, weight_init_stddevs and bias_init_consts arguments must be the same length
Exception: parameters within not_a_list_outside_of_hyperparams are not accepted as a list if hyperparams is False
- pipeline.parameter_parser.primative_type_only(type_annotation)[source]¶
Given annotation, return only primative types that can be read in from commandline, int, float, and str.
Default return value is str, which is default for type parameter in add_arguments
- Args:
type_annotation (type): A type annotation.
- Returns:
type: One of 3 choices, int, float, str
- pipeline.parameter_parser.prune_defaults(params, keep_params={})[source]¶
Removes parameters that are not in keep_params or in get_defaults
- Args:
params (argparse.Namespace): Raw parsed arguments.
keep_params (list): List of parameters to keep
- Returns:
new_dict (dict): Pruned argument dictionary
- pipeline.parameter_parser.remove_unrecognized_arguments(params, hyperparam=False)[source]¶
Removes arguments not recognized by argument parser
Can be used to clean inputs to wrapper function or model_pipeline. Used heavily in hyperparam_search_wrapper
- Args:
params (Namespace or dict): params to filter
- Returns:
dict of parameters
- pipeline.parameter_parser.strip_optional(type_annotation)[source]¶
- In the upgrade to python 3.9 type_annotaions now use
typeing.Optional and we need to strip that off.
- Args:
type_annotation (object): This is a type annotation returned by the inspect module
- Returns:
list(type_annotation) or the __args__ of typing.Optional or typing.Union
- pipeline.parameter_parser.to_str(params_obj)[source]¶
Converts a namespace.argparse object or a dict into a string for command line input
- Args:
params_obj (argparse.Namespace or dict): an argparse namespace object or dict to be converted into a command line input.
E.g. params_obj = argparse.Namespace(arg1 = val1, arg2 = val2, arg3 = val3) OR params_obj = {‘arg1’:val1, ‘arg2’:val2, ‘arg3’:val3}
- Returns:
- str_params (str): parameters in string format
E.g. str_params = ‘–arg1 val1 –arg2 val2 –arg3 val3’
- pipeline.parameter_parser.wrapper(*any_arg)[source]¶
Wrapper to handle the ParseParams class. Calls the correct method depending on the input argument type
- Args:
*any_arg: any single input of a str, dict, ar/printgparse.Namespace, or list
- Returns:
argparse.Namespace: a Namespace.argparse object containing default parameters + user specified parameters
- Raises:
TypeError: Input argument must be a configuration file (str), dict, argparse.Namespace, or list
pipeline.perf_data module¶
Contains class PerfData and its subclasses, which are objects for collecting and computing model performance metrics and predictions
- class pipeline.perf_data.ClassificationPerfData(model_dataset, subset)[source]¶
Bases:
PerfData
Class with methods for accumulating classification model prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run. Abstract class with concrete subclasses for different split strategies.
- Attributes:
- set in __init__
num_tasks (int): Set to None, the number of tasks
num_cmpds (int): Set to None, the number of compounds
num_classes (int): Set to None, the number of classes
- accumulate_preds(predicted_vals, ids, pred_stds=None)[source]¶
Raises: NotImplementedError: The method is implemented by subclasses
- get_prediction_results()[source]¶
Returns a dictionary of performance metrics for a classification model. The dictionary values will contain only primitive Python types, so that it can be easily JSONified.
- Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
- Returns:
pred_results (dict): dictionary of performance metrics for a classification model.
- model_choice_score(score_type='roc_auc')[source]¶
Computes a score function based on the accumulated predicted values, to be used for selecting the best training epoch and other hyperparameters.
- Args:
- score_type (str): The name of the scoring metric to be used, e.g. ‘roc_auc’, ‘precision’,
‘recall’, ‘f1’; see https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter and sklearn.metrics.SCORERS.keys() for a complete list of options. Larger values of the score function indicate better models.
- Returns:
- score (float): A score function value. For multitask models, this will be averaged
over tasks.
- class pipeline.perf_data.EpochManager(wrapper, subsets={'test': 'test', 'train': 'train', 'valid': 'valid'}, production=False, **kwargs)[source]¶
Bases:
object
Manages lists of PerfDatas
This class manages lists of PerfDatas as well as variables related to iteratively training a model over several epochs. This class sets several varaibles in a given ModelWrapper for the sake of backwards compatibility
- Attributes:
- Set in __init__:
- _subsets (dict): Must contain the keys ‘train’, ‘valid’, ‘test’. The values
are used as subsets when calling create_perf_data.
_model_choice_score_type (str): Passed into PerfData.model_choice_score
_log (logger): This is the from wrapper.log
- _should_stop (bool): True when training as satisfied stopping conditions. Either
it has reached the max number of epochs or has exceeded early_stopping_patience
wrapper (ModelWrapper): The model wrapper where this object is being used.
- _new_best_valid_score (function): This function takes no arguments and is called
whenever a new best validation score is achieved.
- accumulate(ei, subset, dset)[source]¶
Accumulate predictions
Makes predictions, accumulate predictions and calculate the performance metric. Calls PerfData.accumulate_preds belonging to the epoch, subset, and given dataset.
- Args:
ei (int): Epoch index
subset (str): Which subset, should be train, valid, or test.
dset (dc.data.Dataset): Calculates the performance for the given dset
- Returns:
float: Performance metric for the given dset.
- compute(ei, subset)[source]¶
Computes performance metrics
This calls PerfData.compute_perf_metrics and saves the result in f’{subset}_epoch_perfs’
- Args:
ei (int): Epoch index
subset (str): Which subset to compute_perf_metrics. Should be train, valid, or test
- Returns:
None
- on_new_best_valid(functional)[source]¶
Sets the function called when a new best validation score is achieved
Saves the function called when there’s a new best validation score.
- Args:
- functional (function): This function takes no arguments and returns nothing. This
function is called when there’s a new best validation score. This can be used to tell the ModelWrapper to save the model.
- Returns:
None
- Side effects:
Saves the _new_best_valid_score function.
- set_make_pred(functional)[source]¶
Sets the function used to make predictions
Sets the function used to make predictions. This must be called before invoking self.update and self.accumulate
- Args:
- functional (function): This function takes one argument, a dc.data.Dataset, and
returns an array of predictions for that dset. This function is called when updating the training state after a given epoch.
- Returns:
None
- Side effects:
Saves the functional as self._make_pred
- should_stop()[source]¶
Returns True when the training loop should stop
- Returns:
bool: True when the training loop should stop
- update(ei, subset, dset=None)[source]¶
Update training state
Updates the training state for a given subset and epoch index with the given dataset.
- Args:
ei (int): Epoch index.
subset (str): Should be train, valid, test
dset (dc.data.Dataset): Updates using this dset
- Returns:
perf (float): the performance of the given dset.
- update_epoch(ei, train_dset=None, valid_dset=None, test_dset=None)[source]¶
Update training state after an epoch
This function updates train/valid/test_perf_data. Call this function once per epoch. Call self.should_stop() after calling this function to see if you should exit the training loop.
Subsets with None arguments will be ignored
- Args:
ei (int): The epoch index
train_dset (dc.data.Dataset): The train dataset
- valid_dset (dc.data.Dataset): The valid dataset. Providing this argument updates
best_valid_score and _should_stop
test_dset (dc.data.Dataset): The test dataset
- Returns:
list: A list of performance values for the provided datasets.
- Side effects:
This function updates self._should_stop
- update_valid(ei)[source]¶
Checks validation score
Checks validation performance of the given epoch index. Updates self._should_stop, checks on early stopping conditions, calls self._new_best_valid_score() when necessary.
- Args:
ei (int): Epoch index
- Returns:
None
- Side effects:
Updates self._should_stop when it’s time to exit the training loop.
- class pipeline.perf_data.EpochManagerKFold(wrapper, subsets={'test': 'test', 'train': 'train', 'valid': 'valid'}, production=False, **kwargs)[source]¶
Bases:
EpochManager
This class manages the training state when using KFold cross validation. This is necessary because this manager uses f’{subset}_epoch_perf_stds’ unlike EpochManager
- class pipeline.perf_data.HybridPerfData(model_dataset, subset)[source]¶
Bases:
PerfData
Class with methods for accumulating regression model prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run. Abstract class with concrete subclasses for different split strategies.
- Attributes:
- set in __init__
num_tasks (int): Set to None, the number of tasks
num_cmpds (int): Set to None, the number of compounds
- accumulate_preds(predicted_vals, ids, pred_stds=None)[source]¶
Raises: NotImplementedError: The method is implemented by subclasses
- compute_perf_metrics(per_task=False)[source]¶
Raises: NotImplementedError: The method is implemented by subclasses
- get_prediction_results()[source]¶
Returns a dictionary of performance metrics for a regression model. The dictionary values should contain only primitive Python types, so that it can be easily JSONified.
- Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
- Returns:
pred_results (dict): dictionary of performance metrics for a regression model.
- model_choice_score(score_type='r2')[source]¶
Computes a score function based on the accumulated predicted values, to be used for selecting the best training epoch and other hyperparameters.
- Args:
score_type (str): The name of the scoring metric to be used, e.g. ‘r2’, ‘mae’, ‘rmse’
- Returns:
- score (float): A score function value. For multitask models, this will be averaged
over tasks.
- class pipeline.perf_data.KFoldClassificationPerfData(model_dataset, transformers, subset, predict_probs=True, transformed=True)[source]¶
Bases:
ClassificationPerfData
Class with methods for accumulating classification model performance data over multiple cross-validation folds and computing performance metrics after all folds have been run.
- Attributes:
- Set in __init__:
subset (str): Label of the type of subset of dataset for tracking predictions num_cmps (int): The number of compounds in the dataset num_tasks (int): The number of tasks in the dataset pred-vals (dict): The dictionary of prediction results folds (int): Initialized at zero, flag for determining which k-fold is being assessed transformers (list of Transformer objects): from input arguments real_vals (dict): The dictionary containing the origin response column values class_names (np.array): Assumes the classes are of deepchem index type (e.g. 0,1,2,…) num_classes (int): The number of classes to predict on
- accumulate_preds(predicted_vals, ids, pred_stds=None)[source]¶
Add training, validation or test set predictions from the current fold to the data structure where we keep track of them.
- Args:
predicted_vals (np.array): Array of the predicted values for the current dataset
ids (np.array): An np.array of compound ids for the current dataset
pred_stds (np.array): An array of the standard deviation in the predictions, not used in this method
- Returns:
None
- Side effects:
Overwrites the attribute pred_vals
Increments folds by 1
- compute_perf_metrics(per_task=False)[source]¶
Computes the ROC AUC metrics for each task based on the accumulated values, averaged over training folds, along with standard deviations of the scores. If per_task is False, the scores are averaged over tasks and the overall standard deviation is reported instead.
- Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
- Returns:
A tuple (roc_auc_mean, roc_auc_std):
roc_auc_mean: A numpy array of mean ROC AUC scores for each task, averaged over folds, if per_task is True.
Otherwise, a float giving the ROC AUC score averaged over both folds and tasks.
- roc_auc_std: A numpy array of standard deviations over folds of ROC AUC values, if per_task is True.
Otherwise, a float giving the overall standard deviation.
- get_pred_values()[source]¶
Returns the predicted values accumulated over training, with any transformations undone. If self.subset is ‘train’, ‘train_valid’ or ‘test’, the function will return the means and standard deviations of the class probabilities over the training folds for each compound, for each task. Otherwise, returns a single set of predicted probabilites for each validation set compound. For all subsets, returns the compound IDs and the most probable classes for each task.
- Returns:
ids (list): list of compound IDs.
pred_classes (np.array): an (ncmpds, ntasks) array of predicted classes.
class_probs (np.array): a (ncmpds, ntasks, nclasses) array of predicted probabilities for the classes, and
prob_stds (np.array): a (ncmpds, ntasks, nclasses) array of standard errors over folds for the class probability estimates (only available for the ‘train’ and ‘test’ subsets; None otherwise).
- get_real_values(ids=None)[source]¶
Returns the real dataset response values as an (ncmpds, ntasks, nclasses) array of indicator bits (if nclasses > 2) or an (ncmpds, ntasks) array of binary classes (if nclasses == 2), with compound IDs in the same order as in the return from get_pred_values() (unless ids is specified).
- Args:
ids (list of str): Optional list of compound IDs to return values for.
- Returns:
np.array of shape (ncmpds, tasks, nclasses): of either indicator bits or a 2D array of binary classes
- get_weights(ids=None)[source]¶
Returns the dataset response weights, as an (ncmpds, ntasks) array in the same ID order as get_pred_values() (unless ids is specified).
- Args:
ids (list of str): Optional list of compound IDs to return values for.
- Returns:
np.array (ncmpds, ntasks) of the real dataset response weights, in the same ID order as get_pred_values().
- class pipeline.perf_data.KFoldRegressionPerfData(model_dataset, transformers, subset, transformed=True)[source]¶
Bases:
RegressionPerfData
Class with methods for accumulating regression model prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run.
- Arguments:
- Set in __init__:
subset (str): Label of the type of subset of dataset for tracking predictions
num_cmps (int): The number of compounds in the dataset
num_tasks (int): The number of tasks in the dataset
pred-vals (dict): The dictionary of prediction results
folds (int): Initialized at zero, flag for determining which k-fold is being assessed
transformers (list of Transformer objects): from input arguments
real_vals (dict): The dictionary containing the origin response column values
- accumulate_preds(predicted_vals, ids, pred_stds=None)[source]¶
Add training, validation or test set predictions from the current fold to the data structure where we keep track of them.
- Args:
predicted_vals (np.array): Array of the predicted values for the current dataset
ids (np.array): An np.array of compound ids for the current dataset
pred_stds (np.array): An array of the standard deviation in the predictions, not used in this method
- Returns:
None
- Raises:
ValueError: If Predicted value dimensions don’t match num_tasks for RegressionPerfData
- Side effects:
Overwrites the attribute pred_vals
Increments folds by 1
- compute_perf_metrics(per_task=False)[source]¶
Computes the R-squared metrics for each task based on the accumulated values, averaged over training folds, along with standard deviations of the scores. If per_task is False, the scores are averaged over tasks and the overall standard deviation is reported instead.
- Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
- Returns:
A tuple (r2_mean, r2_std):
- r2_mean: A numpy array of mean R^2 scores for each task, averaged over folds, if per_task is True.
Otherwise, a float giving the R^2 score averaged over both folds and tasks.
- r2_std: A numpy array of standard deviations over folds of R^2 values, if per_task is True.
Otherwise, a float giving the overall standard deviation.
- get_pred_values()[source]¶
Returns the predicted values accumulated over training, with any transformations undone. If self.subset is ‘train’ or ‘test’, the function will return averages over the training folds for each compound along with standard deviations when there are predictions from multiple folds. Otherwise, returns a single predicted value for each compound.
- Returns:
ids (np.array): list of compound IDs
vals (np.array): (ncmpds, ntasks) array of mean predicted values
fold_stds (np.array): (ncmpds, ntasks) array of standard deviations over folds if applicable, and None otherwise.
- get_real_values(ids=None)[source]¶
Returns the real dataset response values, with any transformations undone, as an (ncmpds, ntasks) array in the same ID order as get_pred_values() (unless ids is specified).
- Args:
ids (list of str): Optional list of compound IDs to return values for.
- Returns:
np.array (ncmpds, ntasks) of the real dataset response values, with any transformations undone, in the same ID order as get_pred_values().
- get_weights(ids=None)[source]¶
Returns the dataset response weights, as an (ncmpds, ntasks) array in the same ID order as get_pred_values() (unless ids is specified).
- Args:
ids (list of str): Optional list of compound IDs to return values for.
- Returns:
np.array (ncmpds, ntasks) of the real dataset response weights, in the same ID order as get_pred_values().
- class pipeline.perf_data.PerfData(model_dataset, subset)[source]¶
Bases:
object
Class with methods for accumulating prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run. Abstract class with concrete subclasses for classification and regression models.
- accumulate_preds(predicted_vals, ids, pred_stds=None)[source]¶
Raises: NotImplementedError: The method is implemented by subclasses
- compute_perf_metrics(per_task=False)[source]¶
Raises: NotImplementedError: The method is implemented by subclasses
- get_prediction_results()[source]¶
Raises: NotImplementedError: The method is implemented by subclasses
- class pipeline.perf_data.RegressionPerfData(model_dataset, subset)[source]¶
Bases:
PerfData
Class with methods for accumulating regression model prediction data over multiple cross-validation folds and computing performance metrics after all folds have been run. Abstract class with concrete subclasses for different split strategies.
- Attributes:
- set in __init__
num_tasks (int): Set to None, the number of tasks
num_cmpds (int): Set to None, the number of compounds
- accumulate_preds(predicted_vals, ids, pred_stds=None)[source]¶
Raises: NotImplementedError: The method is implemented by subclasses
- compute_perf_metrics(per_task=False)[source]¶
Raises: NotImplementedError: The method is implemented by subclasses
- get_prediction_results()[source]¶
Returns a dictionary of performance metrics for a regression model. The dictionary values should contain only primitive Python types, so that it can be easily JSONified.
- Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
- Returns:
pred_results (dict): dictionary of performance metrics for a regression model.
- model_choice_score(score_type='r2')[source]¶
Computes a score function based on the accumulated predicted values, to be used for selecting the best training epoch and other hyperparameters.
- Args:
- score_type (str): The name of the scoring metric to be used, e.g. ‘r2’,
‘neg_mean_squared_error’, ‘neg_mean_absolute_error’, etc.; see https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter and sklearn.metrics.SCORERS.keys() for a complete list of options. Larger values of the score function indicate better models.
- Returns:
- score (float): A score function value. For multitask models, this will be averaged
over tasks.
- class pipeline.perf_data.SimpleClassificationPerfData(model_dataset, transformers, subset, predict_probs=True, transformed=True)[source]¶
Bases:
ClassificationPerfData
Class with methods for collecting classification model prediction and performance data from single-fold training and prediction runs.
- Attributes:
- Set in __init__:
subset (str): Label of the type of subset of dataset for tracking predictions
num_cmps (int): The number of compounds in the dataset
num_tasks (int): The number of tasks in the dataset
pred-vals (dict): The dictionary of prediction results
folds (int): Initialized at zero, flag for determining which k-fold is being assessed
transformers (list of Transformer objects): from input arguments
real_vals (dict): The dictionary containing the origin response column values
class_names (np.array): Assumes the classes are of deepchem index type (e.g. 0,1,2,…)
num_classes (int): The number of classes to predict on
- accumulate_preds(predicted_vals, ids, pred_stds=None)[source]¶
Add training, validation or test set predictions from the current dataset to the data structure where we keep track of them.
- Arguments:
predicted_vals (np.array): Array of predicted values (class probabilities)
ids (list): List of the compound ids of the dataset
pred_stds (np.array): Optional np.array of the prediction standard deviations
- Side effects:
Updates self.pred_vals and self.perf_metrics
- compute_perf_metrics(per_task=False)[source]¶
Returns the ROC_AUC metrics for each task based on the accumulated predictions. If per_task is False, returns the average ROC AUC over tasks.
- Args:
per_task (bool): Whether to return individual ROC AUC scores for each task
- Returns:
- A tuple (roc_auc, std):
- roc_auc: A numpy array of ROC AUC scores, if per_task is True. Otherwise,
a float giving the mean ROC AUC score over tasks.
std: Placeholder for an array of standard deviations. Always None for this class.
- get_pred_values()[source]¶
Returns the predicted values accumulated over training, with any transformations undone. If self.subset is ‘train’, the function will average class probabilities over the k-1 folds in which each compound was part of the training set, and return the most probable class. Otherwise, there should be a single set of predicted probabilites for each validation or test set compound. Returns a tuple (ids, pred_classes, class_probs, prob_stds), where ids is the list of compound IDs, pred_classes is an (ncmpds, ntasks) array of predicted classes, class_probs is a (ncmpds, ntasks, nclasses) array of predicted probabilities for the classes, and prob_stds is a (ncmpds, ntasks, nclasses) array of standard errors for the class probability estimates.
- Returns:
- Tuple (ids, pred_classes, class_probs, prob_stds)
ids (list): Contains the dataset compound ids
pred_classes (np.array): Contains (ncmpds, ntasks) array of prediction classes
class_probs (np.array): Contains (ncmpds, ntasks, nclasses) array of predict class probabilities
prob_stds (np.array): Contains (ncmpds, ntasks, nclasses) array of standard errors for the class probability estimates
- get_real_values(ids=None)[source]¶
Returns the real dataset response values as an (ncmpds, ntasks, nclasses) array of indicator bits. If nclasses == 2, the returned array has dimension (ncmpds, ntasks).
- Args:
ids: Ignored for this class
- Returns:
np.array of the response values of the real dataset as indicator bits
- class pipeline.perf_data.SimpleHybridPerfData(model_dataset, transformers, subset, is_ki, ki_convert_ratio=None, transformed=True)[source]¶
Bases:
HybridPerfData
Class with methods for accumulating hybrid model prediction data from training, validation or test sets and computing performance metrics.
- Attributes:
- Set in __init__:
subset (str): Label of the type of subset of dataset for tracking predictions
num_cmps (int): The number of compounds in the dataset
num_tasks (int): The number of tasks in the dataset
pred-vals (dict): The dictionary of prediction results
folds (int): Initialized at zero, flag for determining which k-fold is being assessed
transformers (list of Transformer objects): from input arguments
real_vals (dict): The dictionary containing the origin response column values
- accumulate_preds(predicted_vals, ids, pred_stds=None)[source]¶
Add training, validation or test set predictions to the data structure where we keep track of them.
- Args:
predicted_vals (np.array): Array of predicted values
ids (list): List of the compound ids of the dataset
pred_stds (np.array): Optional np.array of the prediction standard deviations
- Side effects:
Reshapes the predicted values and the standard deviations (if they are given)
- compute_perf_metrics(per_task=False)[source]¶
Returns the R-squared metrics for each task or averaged over tasks based on the accumulated values
- Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
- Returns:
- A tuple (r2_score, std):
r2_score (np.array): An array of scores for each task, if per_task is True. Otherwise, it is a float containing the average R^2 score over tasks.
std: Always None for this class.
- get_pred_values()[source]¶
Returns the predicted values accumulated over training, with any transformations undone. Returns a tuple (ids, values, stds), where ids is the list of compound IDs, values is a (ncmpds, ntasks) array of predictions, and stds is always None for this class.
- Returns:
- Tuple (ids, vals, stds)
ids (list): Contains the dataset compound ids
vals (np.array): Contains (ncmpds, ntasks) array of prediction
stds (np.array): Contains (ncmpds, ntasks) array of prediction standard deviations
- get_real_values(ids=None)[source]¶
Returns the real dataset response values, with any transformations undone, as an (ncmpds, ntasks) array with compounds in the same ID order as in the return from get_pred_values().
- Args:
ids: Ignored for this class
- Returns:
np.array: Containing the real dataset response values with transformations undone.
- class pipeline.perf_data.SimpleRegressionPerfData(model_dataset, transformers, subset, transformed=True)[source]¶
Bases:
RegressionPerfData
Class with methods for accumulating regression model prediction data from training, validation or test sets and computing performance metrics.
- Attributes:
- Set in __init__:
subset (str): Label of the type of subset of dataset for tracking predictions
num_cmps (int): The number of compounds in the dataset
num_tasks (int): The number of tasks in the dataset
pred-vals (dict): The dictionary of prediction results
folds (int): Initialized at zero, flag for determining which k-fold is being assessed
transformers (list of Transformer objects): from input arguments
real_vals (dict): The dictionary containing the origin response column values
- accumulate_preds(predicted_vals, ids, pred_stds=None)[source]¶
Add training, validation or test set predictions to the data structure where we keep track of them.
- Args:
predicted_vals (np.array): Array of predicted values
ids (list): List of the compound ids of the dataset
pred_stds (np.array): Optional np.array of the prediction standard deviations
- Side effects:
Reshapes the predicted values and the standard deviations (if they are given)
- compute_perf_metrics(per_task=False)[source]¶
Returns the R-squared metrics for each task or averaged over tasks based on the accumulated values
- Args:
per_task (bool): True if calculating per-task metrics, False otherwise.
- Returns:
- A tuple (r2_score, std):
r2_score (np.array): An array of scores for each task, if per_task is True. Otherwise, it is a float containing the average R^2 score over tasks.
std: Always None for this class.
- get_pred_values()[source]¶
Returns the predicted values accumulated over training, with any transformations undone. Returns a tuple (ids, values, stds), where ids is the list of compound IDs, values is a (ncmpds, ntasks) array of predictions, and stds is always None for this class.
- Returns:
- Tuple (ids, vals, stds)
ids (list): Contains the dataset compound ids
vals (np.array): Contains (ncmpds, ntasks) array of prediction
stds (np.array): Contains (ncmpds, ntasks) array of prediction standard deviations
- get_real_values(ids=None)[source]¶
Returns the real dataset response values, with any transformations undone, as an (ncmpds, ntasks) array with compounds in the same ID order as in the return from get_pred_values().
- Args:
ids: Ignored for this class
- Returns:
np.array: Containing the real dataset response values with transformations undone.
- pipeline.perf_data.create_perf_data(prediction_type, model_dataset, transformers, subset, **kwargs)[source]¶
Factory function that creates the right kind of PerfData object for the given subset, prediction_type (classification or regression) and split strategy (k-fold or train/valid/test).
- Args:
prediction_type (str): classification or regression.
model_dataset (ModelDataset): Object representing the full dataset.
transformers (list): A list of transformer objects.
subset (str): Label in [‘train’, ‘valid’, ‘test’, ‘full’], indicating the type of subset of dataset for tracking predictions
**kwargs: Additional PerfData subclass arguments
- Returns:
PerfData object
- Raises:
ValueError: if split_strategy not in [‘train_valid_test’,’k_fold_cv’] ValueError: prediction_type not in [‘regression’,’classification’]
- pipeline.perf_data.negative_predictive_value(y_real, y_pred)[source]¶
Computes negative predictive value of a binary classification model: NPV = TN/(TN+FN).
- Args:
y_real (np.array): Array of ground truth values
y_pred (np.array): Array of predicted values
- Returns:
(float): The negative predictive value
pipeline.perf_plots module¶
Plotting routines for visualizing performance of regression and classification models
- pipeline.perf_plots.plot_ROC_curve(MP, epoch_label='best', pdf_dir=None)[source]¶
Plot ROC curves for a classification model.
- Args:
MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.
epoch_label (str): Label for training epoch to draw predicted values from. Currently ‘best’ is the only allowed value.
pdf_dir (str): If given, output the plots to a PDF file in the given directory.
- Returns:
None
- pipeline.perf_plots.plot_perf_vs_epoch(MP, pdf_dir=None)[source]¶
Plot the current NN model’s standard performance metric (r2_score or roc_auc_score) vs epoch number for the training, validation and test subsets. If the model was trained with k-fold CV, plot shading for the validation set out to += 1 SD from the mean score metric values, and plot the training and test set metrics from the final model retraining rather than the cross-validation phase. Make a second plot showing the validation set model choice score used for ranking training epochs and other hyperparameters against epoch number.
- Args:
MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.
pdf_dir (str): If given, output the plots to a PDF file in the given directory.
- Returns:
None
- pipeline.perf_plots.plot_prec_recall_curve(MP, epoch_label='best', pdf_dir=None)[source]¶
Plot precision-recall curves for a classification model.
- Args:
MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.
epoch_label (str): Label for training epoch to draw predicted values from. Currently ‘best’ is the only allowed value.
pdf_dir (str): If given, output the plots to a PDF file in the given directory.
- Returns:
None
- pipeline.perf_plots.plot_pred_vs_actual(MP, epoch_label='best', threshold=None, error_bars=False, pdf_dir=None)[source]¶
Plot predicted vs actual values from a trained regression model for each split subset (train, valid, and test).
- Args:
MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.
epoch_label (str): Label for training epoch to draw predicted values from. Currently ‘best’ is the only allowed value.
threshold (float): Threshold activity value to mark on plot with dashed lines.
- error_bars (bool): If true and if uncertainty estimates are included in the model predictions, draw error bars
at +- 1 SD from the predicted y values.
pdf_dir (str): If given, output the plots to a PDF file in the given directory.
- Returns:
None
- pipeline.perf_plots.plot_pred_vs_actual_from_df(pred_df, actual_col='avg_pIC50_actual', pred_col='avg_pIC50_pred', label='Prediction of Test Set', ax=None)[source]¶
Plot predicted vs actual values from a trained regression model for a given dataframe.
- Args:
pred_df (Pandas.DataFrame): A dataframe containing predicted and actual values for each compound.
actual_col (str): Column with actual values.
pred_col (str): Column with predicted values.
label (str): Descriptive label for the plot.
ax (matplotlib.axes.Axes): Optional, an axes object to plot onto. If None, one is created.
- Returns:
g (matplotlib.axes.Axes): The axes object with data.
- pipeline.perf_plots.plot_pred_vs_actual_from_file(model_path)[source]¶
Plot predicted vs actual values from a trained regression model from a model tarball.
- Args:
model_path (str): Path to an AMPL model tar.gz file.
- Returns:
None
- Effects:
A matplotlib figure is displayed with subplots for each response column and train/valid/test subsets.
- pipeline.perf_plots.plot_umap_feature_projections(MP, ndim=2, num_neighbors=20, min_dist=0.1, fit_to_train=True, dist_metric='euclidean', dist_metric_kwds={}, target_weight=0, random_seed=17, pdf_dir=None)[source]¶
Projects features of a model’s input dataset using UMAP to 2D or 3D coordinates and draws a scatterplot. Shape-codes plot markers to indicate whether the associated compound was in the training, validation or test set. For classification models, also uses the marker shape to indicate whether the compound’s class was correctly predicted, and uses color to indicate whether the true class was active or inactive. For regression models, uses the marker color to indicate the discrepancy between the predicted and actual values.
- Args:
MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.
ndim (int): Number of dimensions (2 or 3) to project features into.
- num_neighbors (int): Number of nearest neighbors used by UMAP for manifold approximation.
Larger values give a more global view of the data, while smaller values preserve more local detail.
min_dist (float): Parameter used by UMAP to set minimum distance between projected points.
- fit_to_train (bool): If true (the default), fit the UMAP projection to the training set feature vectors only.
Otherwise, fit it to the entire dataset.
- dist_metric (str): Name of metric to use for initial distance matrix computation. Check UMAP documentation
for supported values. The metric should be appropriate for the type of features used in the model (fingerprints or descriptors); note that jaccard is equivalent to Tanimoto distance for ECFP fingerprints.
- dist_metric_kwds (dict): Additional key-value pairs used to parameterize dist_metric; see the UMAP documentation.
In particular, dist_metric_kwds[‘p’] specifies the power/exponent for the Minkowski metric.
- target_weight (float): Weighting factor determining balance between activities and feature values in determining topology
of projected points. A weight of zero prioritizes the feature vectors; weight = 1 prioritizes the activity values, so that compounds with the same activity tend to be clustered together.
random_seed (int): Seed for random number generator.
pdf_dir (str): If given, output the plot to a PDF file in the given directory.
- Returns:
None
- pipeline.perf_plots.plot_umap_train_set_neighbors(MP, num_neighbors=20, min_dist=0.1, dist_metric='euclidean', dist_metric_kwds={}, random_seed=17, pdf_dir=None)[source]¶
Project features of whole dataset to 2 dimensions, without regard to response values. Plot training & validation set or training and test set compounds, color- and symbol-coded according to actual classification and split set. The plot does not take predicted values into account at all. Does not work with regression data.
- Args:
MP (ModelPipeline): Pipeline object for a model that was trained in the current Python session.
- num_neighbors (int): Number of nearest neighbors used by UMAP for manifold approximation.
Larger values give a more global view of the data, while smaller values preserve more local detail.
min_dist (float): Parameter used by UMAP to set minimum distance between projected points.
- dist_metric (str): Name of metric to use for initial distance matrix computation. Check UMAP documentation
for supported values. The metric should be appropriate for the type of features used in the model (fingerprints or descriptors); note that jaccard is equivalent to Tanimoto distance for ECFP fingerprints.
- dist_metric_kwds (dict): Additional key-value pairs used to parameterize dist_metric; see the UMAP documentation.
In particular, dist_metric_kwds[‘p’] specifies the power/exponent for the Minkowski metric.
random_seed (int): Seed for random number generator.
pdf_dir (str): If given, output the plot to a PDF file in the given directory.
pipeline.predict_from_model module¶
Functions to run predictions from a pre-trained model against user-provided data.
- pipeline.predict_from_model.predict_from_model_file(model_path, input_df, id_col='compound_id', smiles_col='rdkit_smiles', response_col=None, conc_col=None, is_featurized=False, dont_standardize=False, AD_method=None, k=5, dist_metric='euclidean', external_training_data=None, max_train_records_for_AD=1000)[source]¶
Loads a pretrained model from a model tarball file and runs predictions on compounds in an input data frame.
- Args:
model_path (str): File path of the model tarball file.
input_df (DataFrame): Input data to run predictions on; must at minimum contain SMILES strings.
id_col (str): Name of the column containing compound IDs. If none is provided, sequential IDs will be generated.
smiles_col (str): Name of the column containing SMILES strings; required.
response_col (str): Name of an optional column containing actual response values; if it is provided, the actual values will be included in the returned data frame to make it easier for you to assess performance.
conc_col (str): Name of an optional column containing the concentration for single concentration activity (% binding) prediction in hybrid models.
is_featurized (bool): True if input_df contains precomputed feature columns. If so, input_df must contain all of the feature columns defined by the featurizer that was used when the model was trained. Default is False which tells AMPL to compute the necessary descriptors.
dont_standardize (bool): By default, SMILES strings are salt-stripped and standardized using RDKit; if you have already done this, or don’t want them to be standardized, set dont_standardize to True.
AD_method (str or None): Method to use to compute applicability domain (AD) index; may be ‘z_score’, ‘local_density’ or None (the default). With the default value, AD indices will not be calculated.
k (int): Number of nearest neighbors of each training data point used to evaluate the AD index.
dist_metric (str): Metric used to compute distances between feature vectors for AD index calculation. Valid values are ‘cityblock’, ‘cosine’, ‘euclidean’, ‘jaccard’, and ‘manhattan’. If binary features such as fingerprints are used in model, ‘jaccard’ (equivalent to Tanimoto distance) may be a better choice than the other metrics which operate on continuous features.
external_training_data (str): Path to a copy of the model training dataset. Used for AD index computation in the case where the model was trained on a different computing system, or more generally when the training data is not accessible at the path saved in the model metadata.
max_train_records_for_AD (int): Maximum number of training data rows to use for AD calculation. Note that the AD calculation time scales as the square of the number of training records used. If the training dataset is larger than max_train_records_for_AD, a random sample of rows with this size is used instead for the AD calculations.
- Returns:
A data frame with compound IDs, SMILES strings, predicted response values, and (optionally) uncertainties and/or AD indices. In addition, actual response values will be included if response_col is specified. Standard prediction error estimates will be included if the model was trained with uncertainty=True. Note that the predicted and actual response columns and standard errors will be labeled according to the response_col setting in the original training data, not the response_col passed to this function. For example, if the original model response_col was ‘pIC50’, the returned data frame will contain columns ‘pIC50_actual’, ‘pIC50_pred’ and ‘pIC50_std’.
For proper AD index calculation, the original data column names must be the same for the new data.
- pipeline.predict_from_model.predict_from_tracker_model(model_uuid, collection, input_df, id_col='compound_id', smiles_col='rdkit_smiles', response_col=None, conc_col=None, is_featurized=False, dont_standardize=False, AD_method=None, k=5, dist_metric='euclidean', max_train_records_for_AD=1000)[source]¶
Loads a pretrained model from the model tracker database and runs predictions on compounds in an input data frame.
- Args:
model_uuid (str): The unique identifier of the model
collection (str): Name of the collection in the model tracker DB containing the model.
input_df (DataFrame): Input data to run predictions on; must at minimum contain SMILES strings.
id_col (str): Name of the column containing compound IDs. If none is provided, sequential IDs will be generated.
smiles_col (str): Name of the column containing SMILES strings; required.
response_col (str): Name of an optional column containing actual response values; if it is provided, the actual values will be included in the returned data frame to make it easier for you to assess performance.
conc_col (str): Name of an optional column containing the concentration for single concentration activity (% binding) prediction in hybrid models.
is_featurized (bool): True if input_df contains precomputed feature columns. If so, input_df must contain all of the feature columns defined by the featurizer that was used when the model was trained. Default is False which tells AMPL to compute the necessary descriptors.
dont_standardize (bool): By default, SMILES strings are salt-stripped and standardized using RDKit; if you have already done this, or don’t want them to be standardized, set dont_standardize to True.
AD_method (str or None): Method to use to compute applicability domain (AD) index; may be ‘z_score’, ‘local_density’ or None (the default). With the default value, AD indices will not be calculated.
k (int): Number of nearest neighbors of each training data point used to evaluate the AD index.
dist_metric (str): Metric used to compute distances between feature vectors for AD index calculation. Valid values are ‘cityblock’, ‘cosine’, ‘euclidean’, ‘jaccard’, and ‘manhattan’. If binary features such as fingerprints are used in model, ‘jaccard’ (equivalent to Tanimoto distance) may be a better choice than the other metrics which operate on continuous features.
max_train_records_for_AD (int): Maximum number of training data rows to use for AD calculation. Note that the AD calculation time scales as the square of the number of training records used. If the training dataset is larger than max_train_records_for_AD, a random sample of rows with this size is used instead for the AD calculations.
- Returns:
A data frame with compound IDs, SMILES strings, predicted response values, and (optionally) uncertainties and/or AD indices. In addition, actual response values will be included if response_col is specified. Standard prediction error estimates will be included if the model was trained with uncertainty=True. Note that the predicted and actual response columns and standard errors will be labeled according to the response_col setting in the original training data, not the response_col passed to this function. For example, if the original model response_col was ‘pIC50’, the returned data frame will contain columns ‘pIC50_actual’, ‘pIC50_pred’ and ‘pIC50_std’.
For proper AD index calculation, the original data column names must be the same for the new data.