pipeline package¶

Submodules¶

pipeline.chem_diversity module¶

Functions to generate matrices or vectors of distances between compounds

pipeline.chem_diversity.calc_dist_diskdataset(feat_type, dist_met, dataset1, dataset2=None, calc_type='nearest', num_nearest=1, **metric_kwargs)[source]¶

Returns an array of distances, either between all compounds in a single dataset or between two datasets, given as DeepChem Dataset objects.

Args:

feat_type (str): How the data was featurized. Current options are ‘ECFP’ or ‘descriptors’.

dist_met (str): What distance metric to use. Current options include tanimoto, cosine, cityblock, euclidean, or any other metric supported by scipy.spatial.distance.pdist().

dataset1 (deepchem.Dataset): Dataset containing features of compounds to be compared.

dataset2 (deepchem.Dataset, optional): Second dataset, if two datasets are to be compared.

calc_type (str): Type of summarization to perform on rows of distance matrix. See function calc_summary for options.

num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.

metric_kwargs: Additional arguments to be passed to functions that calculate metrics.

Returns:

np.ndarray: Vector or matrix of distances between feature vectors.

pipeline.chem_diversity.calc_dist_feat_array(feat_type, dist_met, feat1, feat2=None, calc_type='nearest', num_nearest=1, **metric_kwargs)[source]¶

Returns a vector or array of distances, either between all compounds in a single dataset or between two datasets, given the feature matrices for the dataset(s).

Args:

feat_type (str): How the data was featurized. Current options are ‘ECFP’ or ‘descriptors’.

dist_met (str): What distance metric to use. Current options include tanimoto, cosine, cityblock, euclidean, or any other metric supported by scipy.spatial.distance.pdist().

feat1: feature matrix as a numpy array

feat2: Optional, second feature matrix

calc_type (str): Type of summarization to perform on rows of distance matrix. See function calc_summary for options.

num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.

metric_kwargs: Additional arguments to be passed to functions that calculate metrics.

Returns:

dists: vector or array of distances

pipeline.chem_diversity.calc_dist_smiles(feat_type, dist_met, smiles_arr1, smiles_arr2=None, calc_type='nearest', num_nearest=1, **metric_kwargs)[source]¶

Returns an array of distances between compounds given as SMILES strings, either between all pairs of compounds in a single dataset or between two datasets.

Args:

feat_type (str): How the data is to be featurized, if dist_met is not ‘mcs’. The only option supported currently is ‘ECFP’.

dist_met (str): What distance metric to use. Current options include ‘tanimoto’ and ‘mcs’.

smiles_arr1 (list): First list of SMILES strings.

smiles_arr2 (list): Optional, second list of SMILES strings. Can have only 1 member if wanting compound to matrix comparison.

calc_type (str): Type of summarization to perform on rows of distance matrix. See function calc_summary for options.

num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.

metric_kwargs: Additional arguments to be passed to functions that calculate metrics.

Returns:

dists: vector or array of distances

Todo:

Fix the function _get_descriptors(), which is broken, and re-enable the ‘descriptors’ option for feat_type. Will need to add a parameter to indicate what kind of descriptors should be computed.

Allow other metrics for ECFP features, as in calc_dist_diskdataset().

pipeline.chem_diversity.calc_summary(dist_arr, calc_type, num_nearest=1, within_dset=False)[source]¶

Returns a summary of the distances in dist_arr, depending on calc_type.

Args:

dist_arr: (np.array): Either a 2D distance matrix, or a 1D condensed distance matrix (flattened upper triangle).

calc_type (str): The type of summary values to return:

all: The distance matrix itself

nearest: The distances to the num_nearest nearest neighbors of each compound (except compound itself)

nth_nearest: The distance to the num_nearest’th nearest neighbor

avg_n_nearest: The average of the num_nearest nearest neighbor distances

farthest: The distance to the farthest neighbor

avg: The average of all distances for each compound

num_nearest (int): Additional parameter for calc_types nearest, nth_nearest and avg_n_nearest.

within_dset (bool): True if input distances are between compounds in the same dataset.

Returns:

dists (np.array): A numpy array of distances. For calc_type ‘nearest’ with num_nearest > 1, this is a 2D array with a row for each compound; otherwise it is a 1D array.

pipeline.chem_diversity.upload_distmatrix_to_DS(dist_matrix, feature_type, compound_ids, bucket, title, description, tags, key_values, filepath='./', dataset_key=None)[source]¶

Uploads distance matrix in the data store with the appropriate tags

Args:

dist_matrix (np.ndarray): The distance matrix.

feature_type (str): How the data was featurized.

dist_met (str): What distance metric was used.

compound_ids (list): list of compound ids corresponding to the distance matrix (assumes that distance matrix is square and is the distance between all compounds in a dataset)

bucket (str): bucket the file will be put in

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): List of tags to assign to datastore object.

key_values (dict): Dictionary of key:value pairs to include in the datastore object’s metadata.

filepath (str): local path where you want to store the pickled dataframe

dataset_key (str): If updating a file already in the datastore enter the corresponding dataset_key.: If not, leave as ‘none’ and the dataset_key will be automatically generated.

Returns:

None

pipeline.compare_models module¶

Functions for comparing and visualizing model performance. Most of these functions rely on ATOM’s model tracker and datastore services, which are not part of the standard AMPL installation, but a few functions will work on collections of models saved as local files.

pipeline.compare_models.copy_best_filesystem_models(result_dir, dest_dir, pred_type, force_update=False)[source]¶

Identify the best models for each dataset within a result directory tree (e.g. from a hyperparameter search). Copy the associated model tarballs to a destination directory.

Args:

result_dir (str): Path to model training result directory.

dest_dir (str): Path of directory wherre model tarballs will be copied to.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to copy

force_update (bool): If true, overwrite tarball files that already exist in dest_dir.

Returns:

pd.DataFrame: Table of performance metrics for best models.

pipeline.compare_models.del_ignored_params(dictionary, ignored_params)[source]¶

Deletes ignored parameters from the dictionary if they exist

Args:

dictionary (dict): A dictionary with parameters

ignored_parameters (list(str)): A list of keys potentially in the dictionary

Returns:

None

pipeline.compare_models.extract_collection_perf_metrics(collection_name, output_dir, pred_type='regression')[source]¶

Obtain list of training datasets with models in the given collection. Get performance metrics for models on each dataset and save them as CSV files in the given output directory.

Args:

collection_name (str): Name of model tracker collection to search for models.

output_dir (str): Directory where tables of performance metrics will be written.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

Returns:

None

pipeline.compare_models.extract_model_and_feature_parameters(metadata_dict)[source]¶

Given a config file, extract model and featurizer parameters. Looks for parameter names that end in *_specific. e.g. nn_specific, auto_featurizer_specific

Args:: model_metadict (dict): Dictionary containing NON-FLATTENED metadata for an AMPL model
Returns:: dictionary containing featurizer and model parameters. Most contain the following keys. [‘max_epochs’, ‘best_epoch’, ‘learning_rate’, ‘layer_sizes’, ‘dropouts’, ‘rf_estimators’, ‘rf_max_features’, ‘rf_max_depth’, ‘xgb_gamma’, ‘xgb_learning_rate’, ‘xgb_max_depth’, ‘xgb_colsample_bytree’, ‘xgb_subsample’, ‘xgb_n_estimators’, ‘xgb_min_child_weight’, ‘featurizer_parameters_dict’, ‘model_parameters_dict’]

pipeline.compare_models.get_best_models_info(col_names=None, bucket='public', pred_type='regression', result_dir=None, PK_pipeline=False, output_dir='/usr/local/data', shortlist_key=None, input_dset_keys=None, save_results=False, subset='valid', metric_type=None, selection_type='max', other_filters={})[source]¶

Tabulate parameters and performance metrics for the best models, according to a given metric, trained against each specified dataset.

Args:

col_names (list of str): List of model tracker collections to search.

bucket (str): Datastore bucket for training datasets.

pred_type (str): Type of models (regression or classification).

result_dir (list of str): Result directories of the models, if model tracker is not supported.

PK_pipeline (bool): Are we being called from PK pipeline?

output_dir (str): Directory to write output table to.

shortlist_key (str): Datastore key for table of datasets to query models for.

input_dset_keys (str or list of str): List of datastore keys for datasets to query models for. Either shortlist_key or input_dset_keys must be specified, but not both.

save_results (bool): If True, write the table of results to a CSV file.

subset (str): Input dataset subset (‘train’, ‘valid’, or ‘test’) for which metrics are used to select best models.

metric_type (str): Type of performance metric (r2_score, roc_auc_score, etc.) to use to select best models.

selection_type (str): Score criterion (‘max’ or ‘min’) to use to select best models.

other_filters (dict): Additional selection criteria to include in model query.

Returns:

top_models_df (DataFrame): Table of parameters and metrics for best models for each dataset.

pipeline.compare_models.get_best_perf_table(metric_type, col_name=None, result_dir=None, model_uuid=None, metadata_dict=None, PK_pipe=False)[source]¶

Extract parameters and training run performance metrics for a single model. The model may be specified either by a metadata dictionary, a model_uuid or a result directory; in the model_uuid case, the function queries the model tracker DB for the model metadata. For models saved in the filesystem, can query the performance data from the original result directory, but not from a saved tarball.

Args:

metric_type (str): Performance metric to include in result dictionary.

col_name (str): Collection name containing model, if model is specified by model_uuid.

result_dir (str): result directory of the model, if Model tracker is not supported and metadata_dict not provided.

model_uuid (str): UUID of model to query, if metadata_dict is not provided.

metadata_dict (dict): Full metadata dictionary for a model, including training metrics and dataset metadata.

PK_pipe (bool): If True, include some additional parameters in the result dictionary specific to PK models.

Returns:

model_info (dict): Dictionary of parameter or metric name - value pairs.

Todo:

Add support for models saved as local tarball files.

pipeline.compare_models.get_collection_datasets(collection_name)[source]¶

Returns a list of unique training datasets used for all models in a given collection.

Args:: collection_name (str): Name of model tracker collection to search for models.
Returns:: list: List of model training (dataset_key, bucket) tuples.

pipeline.compare_models.get_dataset_models(collection_names, filter_dict={})[source]¶

Query the model tracker for all models saved in the model tracker DB under the given collection names. Returns a dictionary mapping (dataset_key,bucket) pairs to the list of (collection,model_uuid) pairs trained on the corresponding datasets.

Args:

collection_names (list): List of names of model tracker collections to search for models.

filter_dict (dict): Additional filter criteria to use in model query.

Returns:

dict: Dictionary mapping training set (dataset_key, bucket) tuples to (collection, model_uuid) pairs.

pipeline.compare_models.get_filesystem_models(result_dir, pred_type)[source]¶: Identify all models in result_dir and create perf_result table with ‘tarball_path’ column containing a path to each tarball.

pipeline.compare_models.get_filesystem_perf_results(result_dir, pred_type='classification')[source]¶

Retrieve metadata and performance metrics for models stored in the filesystem from a hyperparameter search run.

Args:

result_dir (str): Root directory for results from a hyperparameter search training run.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

Returns:

pd.DataFrame: Table of metadata fields and performance metrics.

pipeline.compare_models.get_multitask_perf_from_files(result_dir, pred_type='regression')[source]¶

Retrieve model metadata and performance metrics stored in the filesystem from a multitask hyperparameter search. Format the per-task performance metrics in a table with a row for each task and columns for each model/subset combination.

Args:

result_dir (str): Path to root result directory containing output from a hyperparameter search run.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

Returns:

pd.DataFrame: Table of model metadata fields and performance metrics.

pipeline.compare_models.get_multitask_perf_from_files_new(result_dir, pred_type='regression')[source]¶

Retrieve model metadata and performance metrics stored in the filesystem from a multitask hyperparameter search. Format the per-task performance metrics in a table with a row for each task and columns for each model/subset combination.

Args:

result_dir (str): Path to root result directory containing output from a hyperparameter search run.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

Returns:

pd.DataFrame: Table of model metadata fields and performance metrics.

pipeline.compare_models.get_multitask_perf_from_tracker(collection_name, response_cols=None, expand_responses=None, expand_subsets='test', exhaustive=False)[source]¶

Retrieve full metadata and metrics from model tracker for all models in a collection and format them into a table, including per-task performance metrics for multitask models.

Meant for multitask NN models, but works for single task models as well.

By AKP. Works for model tracker as of 10/2020

Args:

collection_name (str): Name of model tracker collection to search for models.

response_cols (list, str or None): Names of tasks (response columns) to query performance results for.: If None, checks to see if the entire collection has the same response cols. Otherwise, should be list of strings or a comma-separated string. asks for clarification. Note: make sure response cols are listed in same order as in metadata. Recommended: None first, then clarify.
expand_responses (list, str or None): Names of tasks / response columns you want to include results for in: the final dataframe. Useful if you have a lot of tasks and only want to look at the performance of a few of them. Must also be a list or comma separated string, and must be a subset of response_cols. If None, will expand all responses.
expand_subsets (list, str or None): Dataset subsets (‘train’, ‘valid’ and/or ‘test’) to show metrics for.: Again, must be list or comma separated string, or None to expand all.
exhaustive (bool): If True, return large dataframe with all model tracker metadata minus any columns not: in expand_responses. If False, return trimmed dataframe with most relevant columns.

Returns:

pd.DataFrame: Table of model metadata fields and performance metrics.

pipeline.compare_models.get_summary_metadata_table(uuids, collections=None)[source]¶

Tabulate metadata fields and performance metrics for a set of models identified by specific model_uuids.

Args:

uuids (list): List of model UUIDs to query.

collections (list or str): Names of collections in model tracker DB to get models from. If collections is: a string, it must identify one collection to search for all models. If a list, it must be of the same length as uuids. If not provided, all collections will be searched.

Returns:

pd.DataFrame: Table of metadata fields and performance metrics for models.

pipeline.compare_models.get_summary_perf_tables(collection_names=None, filter_dict={}, result_dir=None, prediction_type='regression', verbose=False)[source]¶

Load model parameters and performance metrics from model tracker for all models saved in the model tracker DB under the given collection names (or result directory if Model tracker is not available) with the given prediction type. Tabulate the parameters and metrics including:

dataset (assay name, target, parameter, key, bucket) dataset size (train/valid/test/total) number of training folds model type (NN or RF) featurizer transformation type metrics: r2_score, mae_score and rms_score for regression, or ROC AUC for classification

Args:

collection_names (list): Names of model tracker collections to search for models.

filter_dict (dict): Additional filter criteria to use in model query.

result_dir (str or list): Directories to search for models; must be provided if the model tracker DB is not available.

prediction_type (str): Type of models (classification or regression) to query.

verbose (bool): If true, print status messages as collections are processed.

Returns:

pd.DataFrame: Table of model metadata fields and performance metrics.

pipeline.compare_models.get_tarball_perf_table(model_tarball, pred_type='classification')[source]¶

Retrieve model metadata and performance metrics for a model saved as a tarball (.tar.gz) file.

Args:

model_tarball (str): Path of model tarball file, named as model.tar.gz.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of model.

Returns:

tuple (pd.DataFrame, dict): Table of performance metrics and a dictionary of model metadata.

pipeline.compare_models.get_training_datasets(collection_names)[source]¶

Query the model tracker DB for all the unique dataset keys and buckets used to train models in the given collections.

Args:: collection_names (list): List of names of model tracker collections to search for models.
Returns:: dict: Dictionary mapping collection names to lists of (dataset_key, bucket) tuples for training sets.

pipeline.compare_models.get_training_perf_table(dataset_key, bucket, collection_name, pred_type='regression', other_filters={})[source]¶

Load performance metrics from model tracker for all models saved in the model tracker DB under a given collection that were trained against a particular dataset. Identify training parameters that vary between models, and generate plots of performance vs particular combinations of parameters.

Args:

dataset_key (str): Training dataset key.

bucket (str): Training dataset bucket.

collection_name (str): Name of model tracker collection to search for models.

pred_type (str): Prediction type (‘classification’ or ‘regression’) of models to query.

other_filters (dict): Other filter criteria to use in querying models.

Returns:

pd.DataFrame: Table of models and performance metrics.

pipeline.compare_models.num_trainable_parameters_from_file(tar_path)[source]¶

Return number of trainable paramters from tarfile

Given a tar file for a DeepChem model this will return the number of trainable parameters

Args:: tar_path (str): Path to a DeepChem model
Returns:: int: Number of trainable parameters.
Raises:: ValueError: If the model is not a DeepChem neural network model

pipeline.dist_metrics module¶

Distance metrics for compounds: Tanimoto and maximum common substructure (MCS)

pipeline.dist_metrics.mcs(mols1, mols2=None)[source]¶

Computes maximum common substructure (MCS) distances between pairs of molecules.

The MCS distance between molecules m1 and m2 is one minus the average of fMCS(m1,m2) and fMCS(m2,m1), where fMCS(m1,m2) is the fraction of m1’s atoms that are part of the largest common substructure of m1 and m2.

Args:

mols1 (Sequence of rdkit.Mol): First list of molecules.

mols2 (Sequence of rdkit.Mol, optional): Second list of molecules.: If not provided, computes MCS distances between pairs of molecules in mols1. Otherwise, computes a matrix of distances between pairs of molecules from mols1 and mols2.

Returns:

np.ndarray: Matrix of pairwise distances between molecules.

pipeline.dist_metrics.tanimoto(fps1, fps2=None)[source]¶

Compute Tanimoto distances between sets of ECFP fingerprints.

Args:

fps1 (Sequence): First list of ECFP fingerprint vectors.

fps2 (Sequence, optional): Second list of ECFP fingerprint vectors.: If not provided, computes distances between pairs of fingerprints in fps1. Otherwise, computes a matrix of distances between pairs of fingerprints in fps1 and fps2.

Returns:

np.ndarray: Matrix of pairwise distances between fingerprints.

pipeline.dist_metrics.tanimoto_single(fp, fps)[source]¶

Compute a vector of Tanimoto distances between a single fingerprint and each fingerprint in a list .

Args:

fp : Fingerprint to be compared.

fps (Sequence): List of ECFP fingerprint vectors.

Returns:

np.ndarray: Vector of distances between fp and each fingerprint in fps.

pipeline.diversity_plots module¶

Plotting routines for visualizing chemical diversity of datasets

pipeline.diversity_plots.diversity_plots(dset_key, datastore=True, bucket='public', title_prefix=None, ecfp_radius=4, umap_file=None, out_dir=None, id_col='compound_id', smiles_col='rdkit_smiles', is_base_smiles=False, response_col=None, max_for_mcs=300, colorpal=None)[source]¶

Plot visualizations of diversity for an arbitrary table of compounds. At minimum, the file should contain columns for a compound ID and a SMILES string. Produces a clustered heatmap display of Tanimoto distances between compounds along with a 2D UMAP projection plot based on ECFP fingerprints, with points colored according to the response variable.

Args:

dset_key (str): Datastore key or filepath for dataset.

datastore (bool): Whether to load dataset from datastore or from filesystem.

bucket (str): Name of datastore bucket containing dataset.

title_prefix (str): Prefix for plot titles.

ecfp_radius (int): Radius for ECFP fingerprint calculation.

umap_file (str, optional): Path to file to write UMAP coordinates to.

out_dir (str, optional): Output directory for plots and tables. If provided, plots will be output as PDF files rather: than in the current notebook, and some additional CSV files will be generated.

id_col (str): Column in dataset containing compound IDs.

smiles_col (str): Column in dataset containing SMILES strings.

is_base_smiles (bool): True if SMILES strings do not need to be salt-stripped and standardized.

response_col (str): Column in dataset containing response values.

max_for_mcs (int): Maximum dataset size for plots based on MCS distance. If the number of compounds is less than this: value, an additional cluster heatmap and UMAP projection plot will be produced based on maximum common substructure distance.

pipeline.diversity_plots.plot_dataset_dist_distr(dataset, feat_type, dist_metric, task_name, **metric_kwargs)[source]¶

Generate a density plot showing the distribution of distances between dataset feature vectors, using the specified feature type and distance metric.

Args:

dataset (deepchem.Dataset): A dataset object. At minimum, it should contain a 2D numpy array ‘X’ of feature vectors.

feat_type (str): Type of features (‘ECFP’ or ‘descriptors’).

dist_metric (str): Name of metric to be used to compute distances; can be anything supported by scipy.spatial.distance.pdist.

task_name (str): Abbreviated name to describe dataset in plot title.

metric_kwargs: Additional arguments to pass to metric.

Returns:

np.ndarray: Distance matrix.

pipeline.diversity_plots.plot_tani_dist_distr(df, smiles_col, df_name, radius=2, subset_col='subset', subsets=False, ref_subset='train', plot_width=6, ndist_max=None, **metric_kwargs)[source]¶

Generate a density plot showing the distribution of nearest neighbor distances between ecfp feature vectors, using the Tanimoto metric. Optionally split by subset.

Args:

df (DataFrame): A data frame containing, at minimum, a column of SMILES strings.

smiles_col (str): Name of the column containing SMILES strings.

df_name (str): Name for the dataset, to be used in the plot title.

radius (int): Radius parameter used to calculate ECFP fingerprints. The default is 2, meaning that ECFP4 fingerprints are calculated.

subset_col (str): Name of the column containing subset names.

subsets (bool): If True, distances are only calculated for compounds not in the reference subset, and the distances computed are to the nearest neighbors in the reference subset.

ref_subset (str): Reference subset for nearest-neighbor distances, if subsets is True.

plot_width (float): Plot width in inches.

ndist_max (int): Not used, included only for backward compatibility.

metric_kwargs: Additional arguments to pass to metric. Not used, included only for backward compatibility.

Returns:

dist (DataFrame): Table of individual nearest-neighbor Tanimoto distance values. If subsets is True, the table will include a column indicating the subset each compound belongs to.

pipeline.feature_importance module¶

Functions to assess feature importance in AMPL models

pipeline.feature_importance.base_feature_importance(model_pipeline=None, params=None)[source]¶

Minimal baseline feature importance function. Given an AMPL model (or the parameters to train a model), returns a data frame with a row for each feature. The columns of the data frame depend on the model type and prediction type. If the model is a binary classifier, the columns include t-statistics and p-values for the differences between the means of the active and inactive compounds. If the model is a random forest, the columns will include the mean decrease in impurity (MDI) of each feature, computed by the scikit-learn feature_importances_ function. See the scikit-learn documentation for warnings about interpreting the MDI importance. For all models, the returned data frame will include feature names, means and standard deviations for each feature.

This function has been tested on RFs and NNs with rdkit descriptors. Other models and feature combinations may not be supported.

Args:

model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.

params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.

Returns:

(imp_df, model_pipeline, pparams) (tuple):: imp_df (DataFrame): Table of feature importance metrics. model_pipeline (ModelPipeline): Pipeline object for model that was passed to or trained by function. pparams (Namespace): Parsed parameters of model.

pipeline.feature_importance.cluster_permutation_importance(model_pipeline=None, params=None, score_type=None, clust_height=1, result_file=None, nreps=10, nworkers=1)[source]¶

Divide the input features used in a model into correlated clusters, then assess the importance of the features by iterating over clusters, permuting the values of all the features in the cluster, and measuring the effect on the model performance metric given by score_type for the training, validation and test subsets.

Args:

model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.

params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.

clust_height (float): Height at which to cut the dendrogram branches to split features into clusters.

result_file (str): Path to a CSV file where a table of features and cluster indices will be written.

nreps (int): Number of repetitions of the permutation and rescoring procedure to perform for each feature; the importance values returned will be averages over repetitions. More repetitions will yield better importance estimates at the cost of greater computing time.

nworkers (int): Number of parallel worker threads to use for permutation and rescoring. Currently ignored; multithreading will be added in a future version.

Returns:

imp_df (DataFrame): Table of feature clusters and importance values

pipeline.feature_importance.display_feature_clusters(model_pipeline=None, params=None, clust_height=1, corr_file=None, show_matrix=False, show_dendro=True)[source]¶

Cluster the input features used in the model specified by model_pipeline or params, using Spearman correlation as a similarity metric. Display a dendrogram and/or a correlation matrix heatmap, so the user can decide the height at which to cut the dendrogram in order to split the features into clusters, for input to cluster_permutation_importance.

Args:

model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.

params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.

clust_height (float): Height at which to draw a cut line in the dendrogram, to show how many clusters will be generated.

corr_file (str): Path to an optional CSV file to be created containing the feature correlation matrix.

show_matrix (bool): If True, plot a correlation matrix heatmap.

show_dendro (bool): If True, plot the dendrogram.

Returns:

corr_linkage (np.ndarray): Linkage matrix from correlation clustering

pipeline.feature_importance.permutation_feature_importance(model_pipeline=None, params=None, score_type=None, nreps=60, nworkers=1, result_file=None)[source]¶

Assess the importance of each feature used by a trained model by permuting the values of each feature in succession in the training, validation and test sets, making predictions, computing performance metrics, and measuring the effect of scrambling each feature on a particular metric.

Args:

model_pipeline (ModelPipeline): A pipeline object for a model that was trained in the current Python session or loaded from the model tracker or a tarball file. Either model_pipeline or params must be provided.

params (dict): Parameter dictionary for a model to be trained and analyzed. Either model_pipeline or a params argument must be passed; if both are passed, params is ignored and the parameters from model_pipeline are used.

score_type (str): Name of the scoring metric to use to assess importance. This can be any of the standard values supported by sklearn.metrics.get_scorer; the AMPL-specific values ‘npv’, ‘mcc’, ‘kappa’, ‘mae’, ‘rmse’, ‘ppv’, ‘cross_entropy’, ‘bal_accuracy’ and ‘avg_precision’ are also supported. Score types for which smaller values are better, such as ‘mae’, ‘rmse’ and ‘cross_entropy’ are mapped to their negative counterparts.

nreps (int): Number of repetitions of the permutation and rescoring procedure to perform for each feature; the importance values returned will be averages over repetitions. More repetitions will yield better importance estimates at the cost of greater computing time.

nworkers (int): Number of parallel worker threads to use for permutation and rescoring.

result_file (str): Optional path to a CSV file to which the importance table will be written.

Returns:

imp_df (DataFrame): Table of features and importance metrics. The table will include the columns returned by base_feature_importance, along with the permutation importance scores for each feature for the training, validation and test subsets.

pipeline.feature_importance.plot_feature_importances(imp_df, importance_col='valid_perm_importance_mean', max_feat=20, ascending=False)[source]¶

Display a horizontal bar plot showing the relative importances of the most important features or feature clusters, according to the results of permutation_feature_importance, cluster_permutation_importance or a similar function.

Args:

imp_df (DataFrame): Table of results from permutation_feature_importance, cluster_permutation_importance, base_feature_importance or a similar function.

importance_col (str): Name of the column in imp_df to plot values from.

max_feat (int): The maximum number of features or feature clusters to plot values for.

ascending (bool): Should the features be ordered by ascending values of importance_col? Defaults to False; can be set True for p-values or something else where small values mean greater importance.

Returns:

None

pipeline.hyper_perf_plots module¶

Functions for visualizing hyperparameter performance. These functions work with a dataframe of model performance metrics and hyperparameter specifications from compare_models.py. For models on the tracker, use get_multitask_perf_from_tracker(). For models in the file system, use get_filesystem_perf_results(). By Amanda P. 7/19/2022

pipeline.hyper_perf_plots.get_score_types()[source]¶: Helper function to show score type choices.

pipeline.hyper_perf_plots.plot_nn_perf(df, scoretype='r2_score', subset='valid')[source]¶

This function plots scatterplots of performance scores based on their NN hyperparameters.

Args: