utils package¶

Submodules¶

utils.compare_splits_plots module¶

class utils.compare_splits_plots.SplitStats(total_df, split_df, smiles_col, id_col, response_cols)[source]¶

Bases: object

This object manages a dataset and a given split dataframe.

dist_hist_plot(dists, title, dist_path='')[source]¶

Creates a histogram of pairwise Tanimoto distances between training and test sets

Args:

dist_path (str): Optional Where to save the plot. The string ‘_dist_hist’ will be: appended to this input

dist_hist_train_v_test_plot(ax=None)[source]¶

Plots histogram of nearest neighbor Tanimoto distances between test and training subset compounds.

Args:: ax (matploblib Axes): Axes object to draw plot in. If None, one will be created.
Returns:: ax (matploblib Axes): Axes object for plot

dist_hist_train_v_valid_plot(ax=None)[source]¶

Plots histogram of nearest neighbor Tanimoto distances between valid and training subset compounds.

Args:: ax (matploblib Axes): Axes object to draw plot in. If None, one will be created.
Returns:: ax (matploblib Axes): Axes object for plot

make_all_plots(dist_path='')[source]¶

Makes a series of diagnostic plots

Args:

dist_path (str): Optional Where to save the plot. The string ‘_frac_box’ will be: appended to this input

print_stats()[source]¶: Prints useful statistics to stdout

subset_frac_plot(dist_path='')[source]¶

Makes a box plot of the subset fractions

Args:

dist_path (str): Optional Where to save the plot. The string ‘_frac_box’ will be: appended to this input

umap_plot(dist_path='')[source]¶

Plots the first 10000 samples in Umap space using Morgan Fingerprints

Args:

dist_path (str): Optional Where to save the plot. The string ‘_umap_scatter’ will be: appended to this input

utils.compare_splits_plots.parse_args()[source]¶

utils.compare_splits_plots.save_figure(filename)[source]¶

Saves a figure to disk. Saves both png and svg formats.

Args:: filename (str): The name of the figure.

utils.compare_splits_plots.split(total_df, split_df, id_col)[source]¶

Splits a dataset into training, test and validation sets using a given split.

Args:

total_df (DataFrame): A pandas dataframe. split_df (DataFrame): A split dataframe containing ‘cmpd_id’ and ‘subset’ columns. id_col (str): The ID column in total_df

Returns:

(DataFrame, DataFrame, DataFrame): Three dataframes for train, test, and valid: respectively.

utils.curate_data module¶

utils.curate_data.add_classification_column(thresholds, value_column, label_column, data, right_inclusive=True)[source]¶

Add a classification column to a DataFrame.

Add a classification column ‘label_column’ to DataFrame ‘data’ based on values in ‘value_column’, according to a sequence of thresholds. The number of classes is one plus the number of thresholds.

Args:

thresholds (float or sequence of floats): Thresholds to use to assign class labels. Label i will: be assigned to values such that thresholds[i-1] < value <= thresholds[i] (if right_inclusive is True) or thresholds[i-1] <= value < thresholds[i] (otherwise).

value_column (str): Name of the column from which class labels are derived.

label_column (str): Name of the new column to be created for class labels.

data (DataFrame): DataFrame holding all data.

right_inclusive (bool): Whether the thresholding intervals are closed on the right or on the left.: Set this False to get the same behavior as add_binary_tertiary_classification. The default behavior is preferred for the common case where the classification is based on a left-censoring threshold.

Returns:

DataFrame: DataFrame updated to include class label column.

utils.curate_data.aggregate_assay_data(assay_df, value_col='VALUE_NUM', output_value_col=None, label_actives=True, active_thresh=None, id_col='CMPD_NUMBER', smiles_col='rdkit_smiles', relation_col='VALUE_FLAG', date_col=None, verbose=False)[source]¶

Aggregates replicated values in assay data

Map RDKit SMILES strings in assay_df to base structures, then compute an MLE estimate of the mean value over replicate measurements for the same SMILES strings, taking censoring into account. Generate an aggregated result table with one value for each unique base SMILES string, to be used in an ML-ready dataset.

Args:

assay_df (DataFrame): The input DataFrame to be processed.

value_col (str): The column in the DataFrame containing assay values to be averaged.

output_value_col (str): Optional; the column name to use in the output DataFrame for the averaged data.

label_actives (bool): If True, generate an additional column ‘active’ indicating whether the mean value is above a threshold specified by active_thresh.

active_thresh (float): The threshold to be used for labeling compounds as active or inactive.: If active_thresh is None (the default), the threshold used is the minimum reported value across all records with left-censored values (i.e., those with ‘<’ in the relation column.

id_col (str): The input DataFrame column containing compound IDs.

smiles_col (str): The input DataFrame column containing SMILES strings.

relation_col (str): The input DataFrame column containing relational operators (<, >, etc.).

date_col (str): The input DataFrame column containing dates when the assay data was uploaded. If not None, the code will assign the earliest: date among replicates to the aggregate data record.

Returns:

A DataFrame containing averaged assay values, with one value per compound.

utils.curate_data.average_and_remove_duplicates(column, tolerance, list_bad_duplicates, data, max_stdev=100000, compound_id='CMPD_NUMBER', rm_duplicate_only=False, smiles_col='rdkit_smiles_parent')[source]¶

This while loop loops through until no ‘bad duplicates’ are left.

This function removes duplicates based on max_stdev and tolerance. If the value in data[column] falls too far from the mean based on tolerance and max_stdev then that entry is removed. This is repeated until all bad entries are removed

Args:

column (str): column with the value of interest

tolerance (float): acceptable % difference between value and average: ie.: if “[(value - mean)/mean*100]>tolerance” then remove data row

list_bad_duplicates (str): ‘Yes’ to list the bad duplicates

data (DataFrame): input DataFrame

max_stdev (float): maximum standard deviation threshold

compound_id (str): column containing compound ids

rm_duplicate_only (bool): only remove bad duplicates, don’t average good ones, the resulting table can be fed into aggregate assay data to further process.: note: The mean is recalculated on each loop through to make sure it isn’t skewed by the ‘bad duplicate’ values

smiles_col (str): column containing base rdkit smiles strings

Returns:

DataFrame: Returns remaining rows after all bad duplicates have been removed.

utils.curate_data.create_new_rows_for_extra_results(extra_result_col, value_col, data)[source]¶

Moves results from an extra column to an existing column

Returns a new DataFrame with values from ‘extra_result_col’ appended to the end of ‘value_col’. NaN values in ‘extra_result_col’ are dropped. ‘Extra_result_col’ is dropped from the resulting DataFrame

Args:

extra_result_col (str): A column in ‘data’.

value_col (str): A column in ‘data’.

data (DataFrame):

Returns:

DataFrame

utils.curate_data.filter_in_by_column_values(column, values, data)[source]¶

Include rows only for given values in specified column.

Filters in all rows in data if row[column] in values.

Args:

column (str): Name of a column in data.

values (iterable): An iterable, Series, DataFrame, or dict of values: contained in data[column].

data (DataFrame): A DataFrame.

Returns:

DataFrame: DataFrame containing filtered rows.

utils.curate_data.filter_in_out_by_column_values(column, values, data, in_out)[source]¶

Include rows only for given values in specified column.

Given a DataFrame, column, and an iterable, Series, DataFrame, or dict, of values, return a DataFrame with rows containing value in values or all rows that do not containe a value in values.

Args:

column (str): Name of a column in data.

values (iterable): An iterable, Series, DataFrame, or dict of values: contained in data[column].

data (DataFrame): A DataFrame.

in_out (str): If set to ‘in’, will filter in rows that contain a value: in values. If set to anything else, this function will filter out rows that contian a value in values.

Returns:

DataFrame: DataFrame containing filtered rows.

utils.curate_data.filter_out_by_column_values(column, values, data)[source]¶

Exclude rows only for given values in specified column.

Filters out all rows in data if row[column] in values.

Args:

column (str): Name of a column in data.

values (iterable): An iterable, Series, DataFrame, or dict of values: contained in data[column].

data (DataFrame): A DataFrame.

Returns:

DataFrame: DataFrame containing filtered rows.

utils.curate_data.filter_out_comments(values, values_cs, data)[source]¶

Remove rows that contain the text listed

Removes any rows where data[‘COMMENTS’] contains the words in values or values_cs. Used for removing results that indicate bad data in the comments.

Args:

values (str): list of values that are not case sensitive

values_cs (str): list of values that are case sensitive

data (DataFrame): DataFrame containing a column named ‘COMMENTS’

Returns:

DataFrame: Returns a DataFrame with the remaining rows

utils.curate_data.freq_table(dset_df, column, min_freq=1)[source]¶

Generate a DataFrame tabluating the repeat requencies of unique values.

Generate a DataFrame tabulating the repeat frequencies of each unique value in ‘column’. Restrict it to values occurring at least min_freq times.

Args:

dset_df (DataFrame): An input DataFrame

column (str): The name of one column in DataFrame

min_freq (int): Restrict unique count to at least min_freq times.

Returns:

DataFrame: Dataframe containing two columns: the column passed in as the ‘column’ argument: and the column ‘Count’. The ‘Count’ column contains the number of occurances for each value in the ‘column’ argument.

utils.curate_data.get_rdkit_smiles_parent(data)[source]¶

Strip the salts off the rdkit SMILES strings

First, loops through data and determines the base/parent smiles string for each row. Appends the base smiles string to a new row in a list. Then adds the list as a new column, ‘rdkit_smiles_parent’, in ‘data’. Basically calls base_smiles_from_smiles for each smile in the column ‘rdkit_smiles’

Args:: data (DataFrame): A DataFrame with a column named ‘rdkit_smiles’.
Returns:: DataFrame with column ‘rdkit_smiles_parent’ with salts stripped

utils.curate_data.labeled_freq_table(dset_df, columns, min_freq=1)[source]¶

Generate a frequency table in which additional columns are included.

Generate a frequency table in which additional columns are included. The first column in ‘columns’ is assumed to be a unique ID; there should be a many-to-1 mapping from the ID to each of the additional columns.

Args:

dset_df (DataFrame): The input DataFrame.

columns (list(str)): A list of columns to include in the output frequency table.: The first column in ‘columns’ is assumed to be a unique ID; there should be a many-to-1 mapping from the ID to each of the additional columns.

min_freq (int): Restrict unique count to at least min_freq times.

Returns:

DataFrame: A DataFrame containing a frequency table.

Raises:

Exception: If the DataFrame violates the rule: there should be a many-to-1: mapping from the ID to each of the additional columns.

utils.curate_data.mle_censored_mean(cmpd_df, std_est, value_col='PIC50', relation_col='relation')[source]¶

Computes maximum likelihood estimate of the true mean value for a single replicated compound.

Compute a maximum likelihood estimate of the true mean value underlying the distribution of replicate assay measurements for a single compound. The data may be a mix of censored and uncensored measurements, as indicated by the ‘relation’ column in the input DataFrame cmpd_df. std_est is an estimate for the standard deviation of the distribution, which is assumed to be Gaussian; we typically compute a common estimate for the whole dataset using replicate_rmsd().

Args:

cmpd_df (DataFrame): DataFrame containing measurements and SMILES strings.

std_est (float): An estimate for the standard deviation of the distribution.

smiles_col (str): Name of the column that contains SMILES strings.

value_col (str): Name of the column that contains target values.

relation_col (str): The input DataFrame column containing relational operators (<, >, etc.).

Returns:

float: maximum likelihood estimate of the true mean for a replicated compound str: Relation, ‘’ not censored, ‘>’ right censored, ‘<’ left censored

utils.curate_data.remove_outlier_replicates(df, response_col='pIC50', id_col='compound_id', max_diff_from_median=1.0)[source]¶

Examine groups of replicate measurements for compounds identified by compound ID and compute median response for each group. Eliminate measurements that differ by more than a given value from the median; note that in some groups this will result in all replicates being deleted. This function should be used together with aggregate_assay_data instead of average_and_remove_duplicates to reduce data to a single value per compound.

Args:

df (DataFrame): Table of compounds and response data

response_col (str): Column containing response values

id_col (str): Column that uniquely identifies compounds, and therefore measurements to be treated as replicates.

max_diff_from_median (float): Maximum absolute difference from median value allowed for retained replicates.

Returns:

result_df (DataFrame): Filtered data frame with outlier replicates removed. Rows with NaN values in the: response column are removed as a preprocessing step before outlier detection.

utils.curate_data.replicate_rmsd(dset_df, smiles_col='base_rdkit_smiles', value_col='PIC50', relation_col='relation', default_val=1.0)[source]¶

Compute RMS deviation of all replicate uncensored measurements from means

Compute RMS deviation of all replicate uncensored measurements in dset_df from their means. Measurements are treated as replicates if they correspond to the same SMILES string, and are considered censored if the relation column contains > or <. The resulting value is meant to be used as an estimate of measurement error for all compounds in the dataset.

Args:

dset_df (DataFrame): DataFrame containing uncensored measurements and SMILES strings.

smiles_col (str): Name of the column that contains SMILES strings.

value_col (str): Name of the column that contains target values.

relation_col (str): The input DataFrame column containing relational operators (<, >, etc.).

default_val (float): The value to return if there are no compounds with replicate measurements.

Returns:

float: returns root mean squared deviation of all replicate uncensored measurements

utils.curate_data.set_group_permissions(path, system='AD', owner='GSK')[source]¶

Sets file and group permissions to standard values for a dataset containing proprietary data owned by ‘owner’. Later we may add a ‘public’ option, or groups for data from other pharma companies.

Args:

path (string): File path

system (string): Computing environment from which group ownerships will be derived; currently, either ‘LC’ for LC filesystems or ‘AD’ for LLNL systems where owners and groups are managed by Active Directory.

owner (string): Who the data belongs to, either ‘public’ or the name of a company (e.g. ‘GSK’) associated with a restricted access group.

Returns:

None

utils.curate_data.summarize_data(column, num_bins, title, units, filepath, data, log_column='No')[source]¶

Summarizes the in data[column]

Summarizes the data by printing mean, stdev, max, and min of the data. Creates plots of the binned values in data[column]. If log_column != ‘No’ this also creates plots that compares normal and log distributions of the data.

Args:

column (str): Column of interest.

num_bins (int): Number of bins in the histogram.

title (str): Title of the histogram.

units (str): Units for values in ‘column’.

filepath (str): This file path gets printed to the console.

data (DataFrame): Input DataFrame.

log_column (str): Defaults to ‘No’. Any other value will generate: a plot comparing normal and log distributions.

Returns:

None

utils.curate_data.xc50topxc50_for_nm(x)[source]¶

Convert XC50 values measured in nanomolars to -log10 (PX50)

Args :: x (float): input XC50 value measured in nanomolars
Returns :: float: -log10 value of x

utils.data_curation_functions module¶

data_curation_functions.py

Extract Kevin’s functions for curation of public datasets Modify them to match Jonathan’s curation methods in notebook 01/30/2020

utils.data_curation_functions.atom_curation(targ_lst, smiles_lst, shared_inchi_keys)[source]¶

Apply ATOM standard ‘curation’ step to “shared_df”: Average replicate assays, remove duplicates and drop cases with large variance between replicates. mleqonly

Args:

targ_lst (list): A list of targets.

smiles_lst (list): A list of DataFrames.: These DataFrames must contain the columns gene_names, standard_type, standard_relation, standard_inchi_key, PIC50, and rdkit_smiles

shared_inchi_keys (list): A list of inchi keys used in this dataset.

Returns:

list, list:A list of curated DataFrames and a list of the number of compounds: dropped during the curation process for each target.

utils.data_curation_functions.atom_curation_excape(targ_lst, smiles_lst, shared_inchi_keys)[source]¶

Apply ATOM standard ‘curation’ step: Average replicate assays, remove duplicates and drop cases with large variance between replicates. Rows with NaN values in rdkit_smiles, VALUE_NUM_mean, and pXC50 are dropped

Args:

targ_lst (list): A list of targets.

smiles_lst (list): A of DataFrames.: These DataFrames must contain the columns gene_names, standard_type, standard_relation, standard_inchi_key, pXC50, and rdkit_smiles

shared_inchi_keys (list): A list of inchi keys used in this dataset.

Returns:

list:A list of curated DataFrames

utils.data_curation_functions.compute_negative_log_responses(df, unit_col='unit', value_col='value', new_value_col='average_col', relation_col=None, new_relation_col=None, unit_conv={'nM': <function <lambda>>, 'uM': <function <lambda>>}, inplace=False)[source]¶

Given the response values in value_col (IC50, Ki, Kd, etc.), compute their negative base 10 logarithms (pIC50, pKi, pKd, etc.) after converting them to molar units and store them in new_value_col. If relation_col is provided, replace any ‘<’ or ‘>’ relations with their opposites and store the result in new_relation_col (if provided), or in relation_col if note. Rows where the original value is 0 or negative will be dropped from the dataset.

Args:

df (DataFrame): A DataFrame that contains value_col, unit_col and relation_col.

unit_conv (dict): A dictionary mapping concentration units found in unit_col to functions that convert the corresponding concentrations to molar. The default handles micromolar and nanomolar units, represented as ‘uM’ and ‘nM’ respectively.

unit_col (str): Column containing units.

value_col (str): Column containing input values.

new_value_col (str): Column to receive converted values.

relation_col (str): Column containing relational operators for censored data.

new_relation_col (str): Column to receive inverted relations applicable to the negative log transformed values.

inplace (bool): If True, the input DataFrame is modified in place when possible. The default is to return a copy

Returns:

DataFrame: A table containing the transformed values and relations.

utils.data_curation_functions.convert_IC50_to_pIC50(df, unit_col='unit', value_col='value', new_value_col='average_col', relation_col=None, new_relation_col=None, unit_conv={'nM': <function <lambda>>, 'uM': <function <lambda>>}, inplace=False)[source]¶: For backward compatibiltiy only: equivalent to calling compute_negative_log_responses with the same arguments.

utils.data_curation_functions.down_select(df, kv_lst)[source]¶

Filters rows given a set of values

Given a DataFrame and a list of tuples columns (k) to values (v), this function filters out all rows where df[k] == v.

Args:

df (DataFrame): An input DataFrame.

kv_list (list): A list of tuples of (column, value)

Returns:

DataFrame: Rows where all df[k] == v

utils.data_curation_functions.exclude_organometallics(df, smiles_col='rdkit_smiles')[source]¶: Filters data frame df based on column smiles_col to exclude organometallic compounds

utils.data_curation_functions.filter_dtc_data(orig_df, geneNames)[source]¶

Extracts and post processes JAK1, 2, and 3 datasets from DTC

This is specific to the DTC database. Extract JAK1, 2 and 3 datasets from Drug Target Commons database, filtered for data usability. filter criteria:

gene_names == JAK1 | JAK2 | JAK3 InChi key not missing standard_type IC50 units NM standard_relation mappable to =, < or > wildtype_or_mutant != ‘mutated’ valid SMILES maps to valid RDKit base SMILES standard_value not missing pIC50 > 3

Args:

orig_df (DataFrame): Input DataFrame. Must contain the following columns: gene_names: standard_inchi_key, standard_type, standard_units, standard_value, compound_id, wildtype_or_mutant.

geneNames (list): A list of gene names to filter out of orig_df e.g. [‘JAK1’, ‘JAK2’].

Returns:

DataFrame: The filtered rows of the orig_df

utils.data_curation_functions.get_smiles_4dtc_data(nm_df, targ_lst, save_smiles_df)[source]¶

Returns SMILES strings from DTC data

nm_df must be a DataFrame from DTC with the following columns: gene_names, standard_type, standard_value, ‘standard_inchi_key’, and standard_relation.

This function selects all rows where nm_df[‘gene_names’] is in targ_lst, nm_df[‘standard_type’]==’IC50’, nm_df[‘standard_relation’]==’=’, and ‘standard_value’ > 0.

Then pIC50 values are calculated and added to the ‘PIC50’ column, and smiles strings are merged in from save_smiles_df

Args:

nm_df (DataFrame): Input DataFrame.

targ_lst (list): A list of targets.

save_smiles_df (DataFrame): A DataFrame with the column ‘standard_inchi_key’

Returns:

list, list, str: A list of smiles. A list of inchi keys shared between targets.: And a description of the targets

utils.data_curation_functions.get_smiles_dtc_data(nm_df, targ_lst, save_smiles_df)[source]¶

Returns SMILES strings from DTC data

nm_df must be a DataFrame from DTC with the following columns: gene_names, standard_type, standard_value, ‘standard_inchi_key’, and standard_relation.

This function selects all rows where nm_df[‘gene_names’] is in targ_lst, nm_df[‘standard_type’]==’IC50’, nm_df[‘standard_relation’]==’=’, and ‘standard_value’ > 0.

Then pIC50 values are calculated and added to the ‘PIC50’ column, and smiles strings are merged in from save_smiles_df

Args:

nm_df (DataFrame): Input DataFrame.

targ_lst (list): A list of targets.

save_smiles_df (DataFrame): A DataFrame with the column ‘standard_inchi_key’

Returns:

list, list: A list of smiles and a list of inchi keys shared between targets.

utils.data_curation_functions.get_smiles_excape_data(nm_df, targ_lst)[source]¶

Calculate base rdkit smiles

Divides up nm_df based on target and makes one DataFrame for each target.

Rows with NaN pXC50 values are dropped. Base rdkit SMILES are calculated from the SMILES column using atomsci.ddm.utils.struct_utils.base_smiles_from_smiles. A new column, ‘rdkit_smiles, is added to each output DataFrame.

Args:

nm_df (DataFrame): DataFrame for Excape database. Should contain the columns,: pXC50, SMILES, and Ambit_InchiKey

targ_lst (list): A list of targets to filter out of nm_df

Returns:

list, list: A list of DataFrames, one for each target, and a list of: all inchi keys used in the dataset.

utils.data_curation_functions.ic50topic50(x)[source]¶

Calculates pIC50 from IC50

Args:: x (float): An IC50 in nanomolar (nM) units.
Returns:: float: The pIC50.

utils.data_curation_functions.is_organometallic(mol)[source]¶: Returns True if the molecule is organometallic

utils.data_curation_functions.set_data_root(dir)[source]¶

Set global variables for data directories

Creates paths for DTC and Excape given a root data directory. Global variables ‘data_root’ and ‘data_dirs’. ‘data_root’ is the root data directory. ‘data_dirs’ is a dictionary that maps ‘DTC’ and ‘Excape’ to directores calcuated from ‘data_root’

Args:: dir (str): root data directory containing folds ‘dtc’ and ‘excape’
Returns:: None

utils.data_curation_functions.standardize_relations(dset_df, db=None, rel_col=None, output_rel_col=None, invert=False)[source]¶

Standardizes censoring operators

Standardize the censoring operators to =, < or >, and remove any rows whose operators don’t map to a standard one. There is a special case for db=’ChEMBL’ that strips the extra “‘“s around relationship symbols. Assumes relationship columns are ‘Standard Relation’, ‘standard_relation’ and ‘activity_prefix’ for ChEMBL, DTC and GoStar respectively.

This function makes the following mappings: “>” to “>”, “>=” to “>”, “<” to “<”, “<=” to “<”, and “=” to “=”. All other relations are removed from the DataFrame.

Args:

dset_df (DataFrame): Input DataFrame. Must contain either ‘Standard Relation’: or ‘standard_relation’

db (str): Source database. Must be either ‘GoStar’, ‘DTC’ or ‘ChEMBL’. Required if rel_col is not specified.

rel_col (str): Column containing relational operators. If specified, overrides the default relation column: for db.
output_rel_col (str): If specified, put the standardized operators in a new column with this name and leave: the original operator column unchanged.
invert (bool): If true, replace the inequality operators with their inverses. This is useful when a reported: value such as IC50 is converted to its negative log such as pIC50.

Returns:

DataFrame: Dataframe with the standardized relationship sybmols

utils.data_curation_functions.upload_df_dtc_base_smiles_all(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_mleqonly_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC base smiles data to the datastore

Uploads base SMILES string for the DTC dataset.

Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’, the response column is set to PIC50, and the SMILES are assumed to be in ‘base_rdkit_smiles’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_mleqonly_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_dtc_mleqonly(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_smiles_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC mleqonly data to the datastore

Upload mleqonly data to the datastore from the given DataFrame. The DataFrame must contain the column ‘rdkit_smiles’ and ‘VALUE_NUM_mean’. This function is meant to upload data that has been aggregated using atomsci.ddm.utils.curate_data.average_and_remove_duplicates. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_smiles_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_dtc_mleqonly_class(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_mleqonly_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC mleqonly classification data to the datastore

Upload mleqonly classification data to the datastore from the given DataFrame. The DataFrame must contain the column ‘rdkit_smiles’ and ‘binary_class’. This function is meant to upload data that has been aggregated using atomsci.ddm.utils.curate_data.average_and_remove_duplicates and then thresholded to make a binary classification dataset. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_mleqonly_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_dtc_smiles(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, smiles_df, orig_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC smiles data to the datastore

Upload a raw dataset to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

smiles_df (DataFrame): DataFrame containing SMILES to be uploaded.

orig_fileID (str): Source file id used to generate smiles_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_dtc_smiles_regr_all_class(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_smiles_regr_all_fileID, smiles_column, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC classification data to the datastore

Uploads binary classiciation data for the DTC dataset. Classnames are assumed to be ‘active’ and ‘inactive’

Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’, the response column is set to PIC50.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_smiles_regr_all_fileID(str): Source file id used to generate data_df.

smiles_column (str): Column containing SMILES.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_excape_mleqonly(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, smiles_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads Excape mleqonly data to the datastore

Upload mleqonly to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’, smiles_col is ‘rdkit_smiles’ and response_col is ‘VALUE_NUM_mean’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame containing SMILES to be uploaded.

smiles_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_excape_mleqonly_class(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, mleqonly_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads Excape mleqonly classification data to the datastore

data_df contains a binary classification dataset with ‘active’ and ‘incative’ classes.

Upload mleqonly classification to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’, smiles_col is ‘rdkit_smiles’ and response_col is ‘binary_class’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame containing SMILES to be uploaded.

mleqonly_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_excape_smiles(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, smiles_df, orig_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads Excape SMILES data to the datastore

Upload SMILES to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

smiles_df (DataFrame): DataFrame containing SMILES to be uploaded.

orig_fileID (str): Source file id used to generate smiles_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_file_dtc_raw_data(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, file_path, data_origin='journal', species='human', force_update=False)[source]¶

Uploads raw DTC data to the datastore

Upload a raw dataset to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

file_path (str): The filepath of the dataset.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_file_dtc_smiles_regr_all(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, file_path, dtc_smiles_fileID, smiles_column, data_origin='journal', species='human', force_update=False)[source]¶

Uploads regression DTC data to the datastore

Uploads regression dataset for DTC dataset.

Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’, the response column is set to PIC50.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_smiles_fileID(str): Source file id used to generate data_df.

smiles_column (str): Column containing SMILES.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_file_excape_raw_data(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, file_path, data_origin='journal', species='human', force_update=False)[source]¶

Uploads raw Excape data to the datastore

Upload a raw dataset to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

file_path (str): The filepath of the dataset.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.datastore_functions module¶

This file contains functions to make it easier to browse and retrieve data from the datastore. Intended for general use. Add/modify functions as needed. Created 23Jul18 CHW

utils.datastore_functions.bulk_export_kv_for_files(files, save_as, client=None)[source]¶

exports a csv file with 3 columns: bucket, dataset_key, key/value pairs to make reviewing metadata easier

Args:

files (list of tuples): format [(bucket1, dataset_key1), (bucket2, dataset_key2)]

save_as (str): filename to use for new file

Returns:

None

utils.datastore_functions.bulk_update_kv(file, client=None, i=0)[source]¶: this function allows you to upload a properly formatted csv file with 4 columns (order and spelling of headings must match): bucket, dataset_key, kv_add, kv_del the metadata for the files listed will then be updated in the Datastore

utils.datastore_functions.check_key_val(key_values, client=None, df=None, enforced=True)[source]¶

Checks to ensure the keys and values specified are ‘approved’ and that (optionally) all required keys are filled out.

Args:

key_values (dict): keys and values specified by user for a file

client (optional): set client if not using the default

df (DataFrame): dataframe to be uploaded

enforced (bool, optional): If True (default) checks that all required keys are filled out

Returns:

(bool): returns True if all keys and values are ‘approved’ AND enforcement criteria are met

utils.datastore_functions.config_client(token=None, url='https://twintron-blue.llnl.gov/atom/datastore/api/v1.0/swagger.json', new_instance=False)[source]¶

Configures client to access datastore service.

Args:

token (str): Path to file containing token for accessing datastore. Defaults to /usr/local/data/ds_token.txt on non-LC systems, or to $HOME/data/ds_token.txt on LC systems.

url (str): URL for datastore REST service.

new_instance (bool): True to force creation of a new client object. By default, a shared singleton object is returned.

Returns:

returns configured client

utils.datastore_functions.copy_datasets_to_bucket(dataset_keys, from_bucket, to_bucket, ds_client=None)[source]¶

Copy each named dataset from one bucket to another.

Args:

dataset_keys (str or list of str): List of dataset_keys for datasets to move.

from_bucket (str): Bucket where datasets are now.

to_bucket (str): Bucket to move datasets to.

Returns:

None

utils.datastore_functions.dataset_key_exists(dataset_key, bucket, client=None)[source]¶

Returns a boolean indicating whether the given dataset_key is already present in the bucket specified.

Args:

dataset_key (str): the dataset_key for the dataset you want (unique in each bucket)

bucket (str): the bucket the dataset you want resides in

client (optional): set client if not using the default

Returns:

(bool): returns ‘True’ if dataset_key is present in bucket specified

utils.datastore_functions.filter_datasets_interactive(bucket='all', client=None, save_search=False, restrict_key=True, restrict_value=False, dataset_oid_only=False, display_all_columns=False, max_rows=10)[source]¶

This is an old way of searching for files. Not based on the current format. Only use

Args:

bucket (str or list, optional): buckets to search (defaults to searching all buckets you have access to in the datastore)

client (optional): set client if not using the default

restrict_key (bool, optional): if set to True, restricts the search to keys that are on the approved list (see file in bucket with dataset_key: accepted_key_values)

restrict_key (bool, optional): if set to True, restricts the search to values that are on the approved list (see file in bucket with dataset_key: accepted_key_values)

dataset_oid_only (bool, optional): if True, return a list of dataset_oids meeting the criteria; if False, returns a dataframe of all the metadata for the files meeting search criteria

display_all_columns (bool, optional): If ‘False’ (default), then show only a selected subset of the columns

max_rows (int, optional): maximum rows to display during interactive search

Returns:

None

utils.datastore_functions.get_key_val(metadata, key=None)[source]¶

Simple utility to search through list of key value pairs and return values for query key

Args:

metadata list of key,value pairs (list): a list with position 0 = string/list of buckets, and remaining positions dictionaries of search criteria

example:
[{‘key’: ‘species’, ‘value’: [‘rat’] }, {‘key’: ‘assay_category’,’value’: [‘solubility’, ‘volume_of_distribution’]}]

key (str): key to search for

Returns:

returns When key is provide, returns value for matching key if found, None otherwise returns dictionary for the list of key,value pairs when a query key is not provided.

utils.datastore_functions.get_keyval(dataset_oid=None, dataset_key=None, bucket=None, client=None)[source]¶: Requires either dataset_oid or dataset_key+bucket. Function extracts the key:value pairs and converts from the ‘datastore format’ (list of dictionaries) into ‘model tracker format’ (a single dictionary).

utils.datastore_functions.initialize_model_tracker(new_instance=False)[source]¶

Create or obtain a client object for the model tracker service..

Returns:: mlmt_client (MLMTClientSingleton): The client object for the model tracker service.

utils.datastore_functions.key_exists(key, bucket='all', client=None)[source]¶

Check if key exists in bucket(s) specified.

Args:

key (str): the key of interest

bucket (str or list, optional): ‘all’ by default. Specify bucket (as a str or list) to limit search

client (optional): set client if not using the default

Returns:

(bool): Returns True if key exists in bucket(s) specified

utils.datastore_functions.list_key_values(bucket, input_key, category='experimental', client=None)[source]¶

List the values for input key. Requires that the input key be in the ‘approved’ list

Args:

bucket (str or list, optional): buckets to search (defaults to searching all buckets you have access to in the datastore)

input_key: user specified key to query

category: ‘experimental’ or ‘pdb_bind’

client (optional): set client if not using the default

Returns:

None

utils.datastore_functions.repeat_defined_search(defined_search, client=None, to_return='df', display_all_columns=False)[source]¶

Retrieves a DataFrame of files (and associated metadata) meeting the search criteria. This is designed to work well with the output from the filter_datasets_interactive function with defined_search=True

Args:

defined_search (list): a list with position 0 = string/list of buckets, and remaining positions dictionaries of search criteria

example: defined_search = [‘gsk_ml’,: {‘key’: ‘species’, ‘value’: [‘rat’], ‘operator’: ‘in’}, {‘key’: ‘assay_category’,’value’: [‘solubility’, ‘volume_of_distribution’], ‘operator’: ‘in’}]

client (optional): set client if not using the default to_return (str, optional): (default=df)

‘df’ (df_results) = return a pandas dataframe summarizing metadata of files meeting criteria ‘oid’ (dataset_oid) = return a list of dataset_oids meeting criteria ‘ds_key’ (dataset_key) = return a list of dataset_key + bucket tuples

display_all_column (bool, optional): default False. If True, displays all associated metadata instead of just a selected subset

Returns:

One of the following will be returned (based on selection for ‘to_return’) (DataFrame): dataframe of metadata for the files matching the criteria specified in the search (list): list of dataset_oids meeting the criteria specified in the search (list): list of bucket and dataset_key meeting the criteria specified in the search

utils.datastore_functions.retrieve_bucket_names(client=None)[source]¶

Retrieve a list of the bucket names in datastore

Args:: client (optional): set client if not using the default
Returns:: (list): list of bucket names that exist in the datastore which user has access to

utils.datastore_functions.retrieve_columns_from_dataset(bucket, dataset_key, client=None, max_rows=0, column_names='', return_names=False)[source]¶

Retrieve column(s) from csv file (may be bz2 compressed) in datastore.

‘NA’ returned if column not in file (as well as warning message).

Args:

return_names (bool): If true, just return column headers from file

max_rows (int): default=0 which will return all rows

client (optional): set client if not using the default

Returns:

(dict): dictionary corresponding to selected columns

utils.datastore_functions.retrieve_dataset_by_dataset_oid(dataset_oid, client=None, return_metadata=False, nrows=None, print_metadata=False, sep=False, index_col=None, tarpath='.')[source]¶

retrieves the dataset and returns as a pandas dataframe (or other format as needed depending on file type).

Args:

dataset_oid (str): unique identifier for the dataset you want

client (optional): set client if not using the default

return_metadata (bool, optional): if set to True, return a dictionary of the metadata INSTEAD of a dataframe of the data

nrows (num, optional): used to limit the number of rows returned

print_metadata (bool, optional): if set to True, displays the document metadata/properties

sep (str, optional): separator used for csv file

tarpath (str, optional): path to use for tarball files

index_col (int, optional): For csv files, column to use as the row labels of the DataFrame

Returns:

(DataFrame, OrderedDict, str, dict): filetype determines what type of object is returned. xls and xlsx files returns an OrderedDict. tarball (gz and tgz) files returns the location of the files as a string csv returns a DataFrame

optionally, return a dictionary of the metadata only if ‘return_metadata’ is set to TRUE.

utils.datastore_functions.retrieve_dataset_by_datasetkey(dataset_key, bucket, client=None, return_metadata=False, nrows=None, print_metadata=False, sep=False, index_col=None, tarpath='.', **kwargs)[source]¶

Retrieves the dataset and returns as a pandas dataframe (or other format as needed depending on file type).

Args:

dataset_key (str): the dataset_key for the dataset you want (unique in each bucket)

bucket (str): the bucket the dataset you want resides in

client (optional): set client if not using the default

return_metadata (bool, optional): if set to True, return a dictionary of the metadata INSTEAD of a dataframe of the data

nrows (num, optional): used to limit the number of rows returned

print_metadata (bool, optional): if set to True, displays the document metadata/properties

sep (str, optional): separator used for csv file

tarpath (str, optional): path to use for tarball files

index_col (int, optional): For csv files, column to use as the row labels of the DataFrame

Returns:

(DataFrame, OrderedDict, str, dict): filetype determines what type of object is returned. xls and xlsx files returns an OrderedDict. tarball (gz and tgz) files returns the location of the files as a string csv returns a DataFrame

optionally, return a dictionary of the metadata only if ‘return_metadata’ is set to TRUE.

utils.datastore_functions.retrieve_keys(bucket='all', client=None, sort=True)[source]¶

Get a list of keys in bucket(s) specified.

Args:

bucket (str, optional): ‘all’ by default. Specify bucket (as a str or list) to limit search

client (optional): set client if not using the default

sort (bool, optional): if ‘True’ (default), sort the keys alphabetically

Returns:

(list): returns a list of keys in bucket(s) specified

utils.datastore_functions.retrieve_values_for_key(key, bucket='all', client=None)[source]¶

Get a list of values associated with a specified key.

Args:

key (str): the key of interest

bucket (str or list, optional): ‘all’ by default. Specify bucket (as a str or list) to limit search

client (optional): set client if not using the default

Returns:

(list): Returns a list of values (str) associated with a specified key

utils.datastore_functions.search_datasets_by_key_value(key, value, client=None, operator='in', bucket='all', display_all_columns=False)[source]¶

Find datasets by key:value pairs and returns a DataFrame of datasets and associated properties.

Args:

key (str): the key of interest

value (str): the value of interest

client (optional): set client if not using the default

operator (str, optional): ‘in’ by default, but can be changed to any of the following:: =, !=, <, <=, >, >=, all, in, not in

bucket (str or list, optional): ‘all’ by default. Specify bucket (as a str or list) to limit search

display_all_columns (bool, optional): If ‘False’ (default), then show only a selected subset of the columns

Returns:

(DataFrame): summary table of the files and relevant metadata matching the criteria specified

utils.datastore_functions.search_files_interactive(bucket='all', client=None, to_return='df', display_all_columns=False, max_rows=10)[source]¶

This tool helps you find the files you need via an interactive/guided interface.

Args:

bucket (str or list, optional): Buckets to search (defaults to searching all buckets you have access to in the datastore). client (optional): Set client if not using the default. to_return (str): ‘df’ (df_results) = return a pandas dataframe summarizing metadata of files meeting criteria,

‘search’ (search_criteria) = return a list containing search criteria where position 0 = string/list of buckets, and remaining positions are dictionaries of search criteria. Designed to work with ‘repeat_defined_search’ function. ‘oid’ (dataset_oid) = return a list of dataset_oids meeting criteria, ‘ds_key’ (dataset_key) = return a list of dataset_key + bucket tuples.

display_all_columns (bool, optional): If ‘False’ (default), then show only a selected subset of the columns. max_rows (int, optional): Maximum rows to display during interactive search.

Returns:

None

utils.datastore_functions.string_to_dict(dict_string)[source]¶

utils.datastore_functions.string_to_list(list_string)[source]¶

utils.datastore_functions.summarize_datasets(dataset_keys, bucket, client=None, column=None, save_as=None, plot_ht=10, labels=None, last=False)[source]¶

Generate summary statistics such as min/max/median/mean on files specified (all files must be in same bucket).

Args:

dataset_keys (list): dataset_keys corresponding to the files to summarize

bucket (str): bucket the files reside in

client (optional): set client if not using the default

column (str, optional): column to summarize (will be prompted to specify if not pre-specified or if column does not exist in file)

save_as (str, optional): filename to save image of box plot(s) to

plot_ht (int, optional): height of box plots (default = 10)

labels (‘str’, optional):

last (bool optional): If True (default=False), then summarize values from last column instead of specifying column heading

Returns

(DataFrame): returns table summarizing the stats for the file(s) specified

utils.datastore_functions.update_distribution_kv(bucket, dataset_key, client=None, kv_add=None, kv_del=None, return_metadata=False)[source]¶

update the key:values for specified file. No change to file.

Args:

bucket (str): Specify bucket where the file exists

dataset_key (str): dataset_key for the file to update metadata for

client (optional): set client if not using the default

kv_add (dict, optional): key-value pairs to add to the metadata for the file specified

kv_del (str or list, optional): keys to delete from the metadata for the file specified

Returns:

None

utils.datastore_functions.update_kv(bucket, dataset_key, client=None, kv_add=None, kv_del=None, return_metadata=False)[source]¶

update the key:values for specified file. No change to file.

Args:

bucket (str): Specify bucket where the file exists

dataset_key (str): dataset_key for the file to update metadata for

client (optional): set client if not using the default

kv_add (dict, optional): key-value pairs to add to the metadata for the file specified

kv_del (str or list, optional): keys to delete from the metadata for the file specified

Returns:

None

utils.datastore_functions.upload_df_to_DS(df, bucket, filename, title, description, tags, key_values, client=None, dataset_key=None, override_check=True, return_metadata=False, index=False, data_type=None)[source]¶

This function will upload a file to the Datastore along with the associated metadata

Args:

df (DataFrame): dataframe to be uploaded

bucket (str): bucket the file will be put in

filename (str): the filename to save the dataframe as in the datastore. Include the extension

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): must be a list.

key_values (dict): key-value pairs to enable future users to find the file. Must be a dictionary.

client (optional): set client if not using the default

dataset_key (str): If updating a file already in the datastore enter the corresponding dataset_key. If not, leave as ‘none’ and the dataset_key will be automatically generated.

data_type (str,optional): Specify dataType (e.g. csv,bz, etc) if not specified attempt to use file extension

Returns:

(dict): if return_metadata=True, then function returns a dictionary of the metadata for the uploaded dataset.

utils.datastore_functions.upload_file_to_DS(bucket, title, description, tags, key_values, filepath, filename, client=None, dataset_key=None, override_check=True, return_metadata=False, file_ref=False, data_type=None)[source]¶

This function will upload a file to the Datastore along with the associated metadata

Args:

bucket (str): bucket the file will be put in

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): must be a list.

key_values (dict): key:value pairs to enable future users to find the file. Must be a dictionary.

filepath (str): current location of the file

filename (str): current filename of the file

client (optional): set client if not using the default

dataset_key (str, optional): If updating a file already in the datastore enter the corresponding dataset_key. If not, leave as ‘none’ and the dataset_key will be automatically generated.

override_check (bool, optional): If ‘True’ then do NOT perform a check of the keys/values against approved list and enforcement criteria

return_metadata (bool, optional): If ‘True’ (default=False), then return the metadata from the uploaded file

file_ref (bool, optional): If ‘True’ (default=False), links file to the datastore instead of creating a copy to managed by the datastore.

data_type (str,optional): Specify dataType (e.g. csv,bz, etc) if not specified attempt to use file extension

Returns:

(dict): optionally returns the metadata from the uploaded file (if return_metadata=True)

utils.datastore_functions.upload_pickle_to_DS(data, bucket, filename, title, description, tags, key_values, client=None, dataset_key=None, override_check=True, return_metadata=False)[source]¶

This function will upload a file to the Datastore along with the associated metadata.

Args:

data (DataFrame, str, list, tuple, pickle): data to be pickled and uploaded

bucket (str): bucket the file will be put in

filename (str): the filename to save the dataframe as in the datastore. Include the extension

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): must be a list.

key_values (dict): key:value pairs to enable future users to find the file. Must be a dictionary.

client (optional): set client if not using the default

dataset_key (str, optional): If updating a file already in the datastore enter the corresponding dataset_key.: If not, leave as ‘none’ and the dataset_key will be automatically generated.

override_check (bool, optional): If True, overrides checking the metadata for the file when uploaded.

return_metadata (bool, optional): If True, returns metadata for the file after it is uploaded.

Returns:

None

utils.generate_transformers module¶

utils.generate_transformers.build_and_save_feature_transformers_from_csvs(transformer_dataset_key_configs, dest_pkl_path, featurizer, descriptor_type, feature_transform_type, **kwargs)[source]¶

Build feature transformers_x from a list of csv files (or csv+split_uuid tuples) and save them as a pickle file, including the params object used to create them. This function saves feature transformers suitable for use for one fold.

Args:: transformer_dataset_key_configs (list): A list of dictionaries that contain information about each dataset_key such as id_col, smiles_col, response_cols, split_uuids, etc. dest_pkl_path (str): Path to save the pickle file. featurizer (str): The featurizer type (e.g., ‘ecfp’, ‘graphconv’, ‘computed_descriptors’, etc.). descriptor_type (str): Descriptor type (e.g., ‘moe’, ‘rdkit_raw’, etc.). feature_transform_type (str): The type of transformer to use (e.g., ‘RobustScaler’, ‘PowerTransformer’, etc.). **kwargs: Additional keyword arguments for transformer params.
Returns:: None

utils.generate_transformers.filter_outlier_MW(dataset_key, smiles_col, threshold=1000, workers=8)[source]¶

Filters datasets and looks for compounds with very large molecular weights.

Args:: dataset_key (str): Path to the dataset CSV file. smiles_col (str): Name of the column containing SMILES strings. threshold (float): Threshold for filtering large molecular weights. workers (int): Number of workers to use for parallel processing in calculating molecular weights. workers (int): Number of workers to use for parallel processing in calculating molecular weights.
Returns:: List of SMILES with molecular weights that exceed the threshold.

utils.generate_transformers.filter_outlier_features(dataset_key, id_col, smiles_col, response_cols, featurizer, descriptor_type, threshold=10000000000.0)[source]¶

Looks for compounds with very large descriptor values.

Args:: dataset_key_configs (list): List of dataset key configuration dictionaries. featurizer (str): The featurizer type (e.g., ‘ecfp’, ‘graphconv’, ‘computed_descriptors’, etc.). descriptor_type (str): Descriptor type (e.g., ‘moe’, ‘rdkit_raw’, etc.). threshold (float): Threshold for filtering large descriptor values.
Returns:: DataFrame with outlier compound_ids, descriptor column names, and descriptor values that exceed the threshold.

utils.generate_transformers.load_all_datasets(transformer_dataset_key_configs, featurizer, descriptor_type)[source]¶

Loads datasets from configs and builds NumpyDataset

Args:: csvs_or_tuples (list): List of csv file paths or (csv_file, split_uuid) tuples. featurizer (str): The featurizer type (e.g., ‘ecfp’, ‘graphconv’, ‘computed_descriptors’, etc.). descriptor_type (str): Descriptor type (e.g., ‘moe’, ‘rdkit_raw’, etc.).
Returns:: NumpyDataset

utils.generate_transformers.prepare_csv_and_descriptor_with_dummy_response(csv_path, descriptor_type, temp_root, split_uuid='split_uuid')[source]¶

Copies the csv file and its descriptor file to a temp directory, preserving structure, and adds a ‘dummy_response’ column of zeros to both.

Args:: csv_path (str): Path to the original CSV file. descriptor_type (str): Descriptor type to look for in the descriptor file name. temp_root (str): Root of the temporary directory to copy files into. split_uuid (str): Unique identifier for the split file to look for and copy if it exists.
Returns:: (str, str): Paths to the new CSV and descriptor files in the temp directory.

utils.hyperparam_search_wrapper module¶

Script to generate hyperparameter combinations based on input params and send off jobs to a slurm system. Author: Amanda Minnich

class utils.hyperparam_search_wrapper.GeometricSearch(params)[source]¶

Bases: HyperparameterSearch

Generates parameter values in logistic steps, rather than linear like GridSearch does

generate_assay_list()[source]¶

Generates the list of datasets to build models for, with their key, bucket, split, and split uuid

Returns:: None

generate_combo(params_dict)[source]¶

Method to generate all combinations from a given set of key-value pairs

Args:: params_dict: Set of key-value pairs with the key being the param name and the value being the list of values you want to try for that param
Returns:: new_dict: The list of all combinations of parameters

generate_param_combos()[source]¶

Performs additional parsing of parameters and generates all combinations

Returns:: None

split_and_save_dataset(assay_params)[source]¶

Splits a given dataset, saves it, and sets the split_uuid in the metadata

Args:: assay_params: Dataset metadata
Returns:: None

class utils.hyperparam_search_wrapper.GridSearch(params)[source]¶

Bases: HyperparameterSearch

Generates fixed steps on a grid for a given hyperparameter range

generate_assay_list()[source]¶

Generates the list of datasets to build models for, with their key, bucket, split, and split uuid

Returns:: None

generate_combo(params_dict)[source]¶

Method to generate all combinations from a given set of key-value pairs

Args:: params_dict: Set of key-value pairs with the key being the param name and the value being the list of values you want to try for that param
Returns:: new_dict: The list of all combinations of parameters

generate_param_combos()[source]¶

Performs additional parsing of parameters and generates all combinations

Returns:: None

split_and_save_dataset(assay_params)[source]¶

Splits a given dataset, saves it, and sets the split_uuid in the metadata

Args:: assay_params: Dataset metadata
Returns:: None

class utils.hyperparam_search_wrapper.HyperparameterSearch(params)[source]¶

Bases: object

The class for generating and running all hyperparameter combinations based on the input params given

already_run(assay_params, retry_time=10)[source]¶

Checks to see if a model with a given metadata combination has already been built

Args:: assay_params: model metadata information
Returns:: Boolean specifying if model has been previously built

assemble_layers()[source]¶

Reformats layer parameters

Returns:: None

build_jobs()[source]¶

Builds jobs. Reformats parameters as necessary

Returns:: None

filter_jobs(job_list)[source]¶

Removes jobs that should not be run

Returns:: None

generate_assay_list()[source]¶

Generates the list of datasets to build models for, with their key, bucket, split, and split uuid

Returns:: None

generate_combo(params_dict)[source]¶: This is implemented in the specific sub-classes

generate_combos(params_dict)[source]¶

Calls sub-function generate_combo and then uses itertools.product to generate all desired combinations

Args:: params_dict:
Returns:: None

generate_maestro_commands()[source]¶

Generates commands that can be used by maestro

Generates a list of commands that can be put directly into the shell to run model training.

Args:: None
Returns:: list: A list of shell commands

generate_param_combos()[source]¶

Performs additional parsing of parameters and generates all combinations

Returns:: None

generate_searches()[source]¶

Generate a list of training jobs

Generates a list of model training jobs that spans the hyperparameter search space. This function filters out jobs that are redundant by calling filter_jobs

Args:: None
Returns:: list(tuple): A list of tuples that contain assay parameters

generate_split_shortlist(retry_time=60)[source]¶

Processes a shortlist, generates splits for each dataset on the list, and uploads a new shortlist file with the split_uuids included. Generates splits for the split_combos [[0.1,0.1], [0.1,0.2],[0.2,0.2]], [random, scaffold]

Returns:: None

generate_split_shortlist_file()[source]¶

Processes a shortlist, generates splits for each dataset on the list, and uploads a new shortlist file with the split_uuids included. Generates splits for the split_combos [[0.1,0.1], [0.15,0.15], [0.1,0.2], [0.2,0.2]], [random, scaffold]

Returns:: None

get_dataset_metadata(assay_params, retry_time=60)[source]¶

Gather the required metadata for a dataset

Args:: assay_params: dataset metadata
Returns:: None

get_shortlist_df(split_uuids=False, retry_time=60)[source]¶

Get dataframe short list

Args:: split_uuids: Boolean value saying if you want just datasets returned or the split_uuids as well
Returns:: The list of dataset_keys, along with their accompanying bucket, split type, and split_uuid if split_uuids is True

return_split_uuid(dataset_key, bucket=None, splitter=None, split_combo=None, retry_time=60)[source]¶

Loads a dataset, splits it, saves it, and returns the split_uuid

Args:

dataset_key: key for dataset to split

bucket: datastore-specific user group bucket

splitter: Type of splitter to use to split the dataset

split_combo: tuple of form (split_valid_frac, split_test_frac)

Returns:

None

return_split_uuid_file(dataset_key, response_cols, bucket=None, splitter=None, split_combo=None, retry_time=60)[source]¶

Loads a dataset, splits it, saves it, and returns the split_uuid.

Args:

dataset_key: key for dataset to split

bucket: datastore-specific user group bucket

splitter: Type of splitter to use to split the dataset

split_combo: tuple of form (split_valid_frac, split_test_frac)

Returns:

None

run_search()[source]¶

The driver code for generating hyperparameter combinations and submitting jobs

Returns:: None

split_and_save_dataset(assay_params)[source]¶

Splits a given dataset, saves it, and sets the split_uuid in the metadata

Args:: assay_params: Dataset metadata
Returns:: None

submit_jobs(job_list, retry_time=60)[source]¶

Reformats parameters as necessary and then calls run_command in a loop to submit a job for each param combo

Returns:: None

class utils.hyperparam_search_wrapper.OptunaSearch(params)[source]¶

Bases: object

Perform hyperparameter search with Bayesian Optimization (Tree Parzen Estimator) using Optuna.

To use OptunaSearch, modify the config json file as follows:

search_type: use “hyperopt”

result_dir: use two directories (recommended), separated by comma, 1st one will be used to save the best model tarball, 2nd one will be used to store all models during the process. e.g. “result_dir”: “/path/of/the/final/dir,/path/of/the/temp/dir”

model_type: RF or NN, also add max number of HyperOptSearch evaluations, e.g. “model_type”: “RF|100”. If no max number provide, the default 100 will be used. #For NN models only

lr: specify learning rate searching method and related parameters as the following scheme.
method|parameter1,parameter2…

method: supported searching schemes include: choice, uniform, loguniform, and uniformint. See Optuna documentation for details.

parameters:

choice: all values to search from, separated by comma, e.g. choice|0.0001,0.0005,0.0002,0.001

uniform: low and high bound of the interval to serach, e.g. uniform|0.00001,0.001

loguniform: low and high bound (in natural log) of the interval to serach, e.g. loguniform|-13.8,-6.9

uniformint: low and high bound of the interval to serach, e.g. uniformint|8,256

ls: similar as learning_rate, specify number of layers and size of each one.

method|num_layers|parameter1,parameter2… e.g. choice|2|8,16,32,64,128,256,512 #this will generate a two-layer config, each layer takes size from the list “8,16,32,64,128,256,512” e.g. uniformint|3|8,512 #this will generate a three-layer config, each layer takes size from the uniform interval [8,512]

dp: similar as layer_sizes, just make sure dropouts and layer_sizes should have the same number of layers.

e.g. uniform|3|0,0.4 #this will generate a three-layer config, each layer takes size from the uniform interval [0,0.4]

#For RF models only rfe: rf_estimator, same structure as the learning rate above, e.g. uniformint|64,512 #take integer values from a uniform interval [64,512]

rfd: rf_max_depth, e.g. uniformint|8,256

rff: rf_max_feature, e.g. uniformint|8,128

run_search()[source]¶

class utils.hyperparam_search_wrapper.RandomSearch(params)[source]¶

Bases: HyperparameterSearch

Generates the specified number of random parameter values for within the specified range

generate_assay_list()[source]¶

Generates the list of datasets to build models for, with their key, bucket, split, and split uuid

Returns:: None

generate_combo(params_dict)[source]¶

Method to generate all combinations from a given set of key-value pairs

Args:: params_dict: Set of key-value pairs with the key being the param name and the value being the list of values you want to try for that param
Returns:: new_dict: The list of all combinations of parameters

generate_param_combos()[source]¶

Performs additional parsing of parameters and generates all combinations

Returns:: None

split_and_save_dataset(assay_params)[source]¶

Splits a given dataset, saves it, and sets the split_uuid in the metadata

Args:: assay_params: Dataset metadata
Returns:: None

class utils.hyperparam_search_wrapper.UserSpecifiedSearch(params)[source]¶

Bases: HyperparameterSearch

Generates combinations using the user-specified steps

generate_assay_list()[source]¶

Generates the list of datasets to build models for, with their key, bucket, split, and split uuid

Returns:: None

generate_combo(params_dict)[source]¶

Method to generate all combinations from a given set of key-value pairs

Args:: params_dict: Set of key-value pairs with the key being the param name and the value being the list of values you want to try for that param
Returns:: new_dict: The list of all combinations of parameters

generate_param_combos()[source]¶

Performs additional parsing of parameters and generates all combinations

Returns:: None

split_and_save_dataset(assay_params)[source]¶

Splits a given dataset, saves it, and sets the split_uuid in the metadata

Args:: assay_params: Dataset metadata
Returns:: None

utils.hyperparam_search_wrapper.build_optuna_suggest(trial, label, method, param_list)[source]¶

Sample a hyperparameter value from an Optuna trial using the specified distribution.

Translates the same method|params format used across AMPL config files into the corresponding trial.suggest_* call. This function is used by the OptunaSearch class and is not intended for standalone usage.

Note: loguniform param_list values must be in natural-log scale (matching the legacy hyperopt convention), e.g. [-13.8, -6.9]. They are converted to actual-scale bounds via math.exp() before being passed to Optuna.

Args:

trial: An optuna.trial.Trial object. label (str): Name of the hyperparameter (used as the Optuna parameter name). method (str): Sampling method — one of 'choice', 'uniform',

'loguniform', or 'uniformint'.

param_list (list): Parameters for the chosen method:

choice: list of candidate values.
uniform/loguniform/uniformint: [low, high].

Returns:

The sampled value.

Raises:

ValueError: If method is not one of the supported options.

utils.hyperparam_search_wrapper.build_search(params)[source]¶

Builds HyperparamterSearch object

Looks at params.search_type and builds a HyperparamSearch object of the correct flavor. Will exit if the search_type is not recognized.

Args:

params (Namespace): Namespace returned by: atomsci.ddm.pipeline.parameter_parser.wrapper()

Returns:

HyperparameterSearch

utils.hyperparam_search_wrapper.gen_maestro_command(python_path, script_dir, params)[source]¶

Generates a string that can be fed into a command line.

Side Effects:

Dataset key will be converted to an absolute path before: returned. It’s difficult to predict the working directory used when maestro runs the script.

Args:

shell_script: Name of shell script to run

python_path: Path to python version

script_dir: Directory where script lives

params: parameters in dictionary format

Returns:

str: Formatted command in the form of a string

utils.hyperparam_search_wrapper.get_num_params(combo)[source]¶

Calculates the number of parameters in a fully-connected neural networ

Args:: combo: Model parameters
Returns:: tmp_sum: Calculated number of parameters

utils.hyperparam_search_wrapper.main()[source]¶

Entry point when script is run

Args:: None
Returns:: None

utils.hyperparam_search_wrapper.parse_params(param_list)[source]¶

Parse paramters

Parses parameters using parameter_parser.wrapper and filters out unnecessary parameters. Returns what an argparse.Namespace

Args:: *any_arg: any single input of a str, dict, argparse.Namespace, or list
Returns:: argparse.Namespace

utils.hyperparam_search_wrapper.permutate_NNlayer_combo_params(layer_nums, node_nums, dropout_list, max_final_layer_size)[source]¶

Generate combos of layer_sizes(str) and dropouts(str) params from the layer_nums (list), node_nums (list), dropout_list (list).

The permutation will make the NN funnel shaped, so that the next layer can only be smaller or of the same size of the current layer.

Example:: permutate_NNlayer_combo_params([2], [4,8,16], [0], 16) returns [[16, 4], [16, 8], [8,4]] [[0,0],[0,0],[0,0]]

If there are duplicates of the same size, it will create consecutive layers of the same size.

Example:

permutate_NNlayer_combo_params([2], [4,8,8], [0], 16) returns [[8, 8], [8, 4]] [[0,0],[0,0]]

Args:

layer_nums: specify numbers of layers.

node_nums: specify numbers of nodes per layer.

dropout_list: specify the dropouts.

max_last_layer_size: sets the max size of the last layer. It will be set to the smallest node_num if needed.

Returns:

layer_sizes, dropouts: the layer sizes and dropouts generated based on the input parameters

utils.hyperparam_search_wrapper.reformat_filter_dict(filter_dict)[source]¶

Function to reformat a filter dictionary to match the Model Tracker metadata structure. Updated 9/2020 by A. Paulson for new LC model tracker.

Args:: filter_dict: Dictionary containing metadata for model of interest
Returns:: new_filter_dict: Filter dict reformatted

utils.hyperparam_search_wrapper.run_cmd(cmd)[source]¶

Function to submit a job using subprocess

Args:: cmd: Command to run
Returns:: output: Output of command

utils.hyperparam_search_wrapper.run_command(shell_script, python_path, script_dir, params)[source]¶

Function to submit jobs on a slurm system

Args:

shell_script: Name of shell script to run

python_path: Path to python version

script_dir: Directory where script lives

params: parameters in dictionary format

Returns:

None

utils.many_to_one module¶

exception utils.many_to_one.ManyToOneException[source]¶: Bases: Exception

exception utils.many_to_one.NANCompoundIDException[source]¶: Bases: Exception

exception utils.many_to_one.NANSMILESException[source]¶: Bases: Exception

utils.many_to_one.has_nans(df, col)[source]¶

utils.many_to_one.many_to_one(fn, smiles_col, id_col)[source]¶

utils.many_to_one.many_to_one_df(df, smiles_col, id_col)[source]¶

AMPL requires that SMILES and compound_ids have a many to one mapping. This function opens the dataset and checks this restraint. It will also check if any SMILES or compound_ids are empty/nan

Arguments:

df (pd.DataFrame): The DataFrame in question. smiles_col (str): The column containing SMILES. id_col (str): The column containing compound ids

Returns:

True if there is a many to one mapping. Raises one of 3 errors if it:

Has nan compound_ids
Has nan SMILES
Is not a many to one mapping between compound_ids and SMILES

utils.many_to_one.no_nan_ids_or_smiles(df, smiles_col, id_col)[source]¶

utils.model_file_reader module¶

class utils.model_file_reader.ModelFileReader(data_file_path)[source]¶

Bases: object

A class to encapsulate a model’s metadata that you might want read out from a folder. Like read version number, get the dataset key, split uuid etc of a model.

Attributes:

Set in __init__:: data_file_path (str): a model data file or a directory that contains the model

get_dataset_key()[source]¶: Returns: (str): model dataset key

get_descriptor_type()[source]¶: Returns: (str): model descriptor type

get_embedding_and_features()[source]¶: Returns: (bool): embedding_and_features

get_embedding_model_path()[source]¶: Returns: (str): embedding_model_path

get_embedding_model_uuid()[source]¶: Returns: (str): embedding_model_uuid

get_embedding_specific_parameters()[source]¶: Returns: (str): embedding specific parameters

get_featurizer()[source]¶: Returns: (str): model featurizer

get_id_col()[source]¶: Returns: (str): model id column

get_imputer_strategy()[source]¶: Returns: (str): imputer_strategy

get_model_info()[source]¶

Extract the model metadata (and if applicable, model metrics)

Returns:: a dictionary of the most important model parameters and metrics.

get_model_parameters()[source]¶: Returns: (str): model parameters

get_model_type()[source]¶: Returns: (str): model type

get_model_uuid()[source]¶: Returns: (str): model uuid

get_powertransformer_method()[source]¶: Returns: (str): powertransformer_method

get_powertransformer_standardize()[source]¶: Returns: (bool): powertransformer_standardize

get_random_seed()[source]¶: Returns: (int): random seed used in model training. Returns None if not found.

get_response_cols()[source]¶: Returns: (str): model response columns

get_robustscaler_quartile_range()[source]¶: Returns: (list[float]): robustscaler_quartile_range

get_robustscaler_unit_variance()[source]¶: Returns: (bool): robustscaler_unit_variance

get_robustscaler_with_centering()[source]¶: Returns: (bool): robustscaler_with_centering

get_robustscaler_with_scaling()[source]¶: Returns: (bool): robustscaler_with_scaling

get_smiles_col()[source]¶: Returns: (str): model smile columns

get_split_csv()[source]¶: Returns: (str): model split csv

get_split_strategy()[source]¶: Returns: (str): model split strategy

get_split_uuid()[source]¶: Returns: (str): model split_uuid

get_splitter()[source]¶: Returns: (str): model splitter

get_splitting_parameters()[source]¶: Returns: (str): model splitting parameters

get_training_dataset()[source]¶: Returns: (str): model training dataset

get_transformer_dataset_key_configs()[source]¶: Returns: (str): transformer_dataset_key_configs

get_transformer_specific_parameters()[source]¶: Returns: (str): embedding specific parameters

get_version()[source]¶: Returns: (str): model version

utils.model_file_reader.get_multiple_models_metadata(*args)[source]¶

A function that takes model tar.gz file(s) and extract the metadata (and if applicable, model metrics)

Args:: *args: Variable length argument list of model tar.gz file(s)
Returns:: a list of models’ most important model parameters and metrics. or an empty array if it fails to parse the input file(s).
Exception:: IOError: Problem access the file or if fails to parse the input file to an AMPL model

utils.model_file_reader.main(argv)[source]¶

utils.model_retrain module¶

utils.model_retrain.main(argv)[source]¶

utils.model_retrain.train_model(input, output, dskey='', production=False, keep_seed=False)[source]¶

Retrain a model saved in a model_metadata.json file

Args:

input (str): path to model_metadata.json file

output (str): path to output directory

dskey (str): new dataset key if file location has changed

production (bool): retrain the model using production mode

Returns:

the model pipeline object with trained model

utils.model_retrain.train_model_from_tar(input, output, dskey='', production=False, keep_seed=False)[source]¶

Retrain a model saved in a tar.gz file

Args:

input (str): path to a tar.gz file

output (str): path to output directory

dskey (str): new dataset key if file location has changed

keep_seed (bool): True to keep the same random seed.

Returns:

the model pipeline object with trained model

utils.model_retrain.train_model_from_tracker(model_uuid, output_dir, production=False, keep_seed=False)[source]¶

Retrain a model saved in the model tracker, but save it to output_dir and don’t insert it into the model tracker

Args:

model_uuid (str): model tracker model_uuid file

output_dir (str): path to output directory

Returns:

the model pipeline object with trained model

utils.model_retrain.train_models_from_dataset_keys(input, output, pred_type='regression', production=False, keep_seed=False)[source]¶

Retrain a list of models from an input file

Args:

input (str): path to an Excel or csv file. the required columns are ‘dataset_key’ and ‘bucket’ (public, private_file or Filesystem).

output (str): path to output directory

pred_type (str, optional): set the model prediction type. if not, uses the default ‘regression’

Returns:

None

utils.model_version_utils module¶

model_version_utils.py

Misc utilities to get the AMPL version(s) used to train one or more models and check them for compatibility with the currently running version of AMPL.:

To check the model version

usage: model_version_utils.py [-h] -i INPUT

optional arguments:

-h, --help

show this help message and exit

-i INPUT, --input INPUT

input directory/file (required)

utils.model_version_utils.check_version_compatible(input, ignore_check=False)[source]¶

Compare the input file’s version against the running AMPL version to see if they are compatible

Args:: filename (str): file or version number
Returns:: True if the input model version matches the compatible AMPL version group

utils.model_version_utils.get_ampl_version()[source]¶

Get the running ampl version

Returns:: the AMPL version

utils.model_version_utils.get_ampl_version_from_dir(dirname)[source]¶

Get the AMPL versions for all the models stored under the given directory and its subdirectories, recursively.

Args:: dirname (str): directory
Returns:: list of AMPL versions

utils.model_version_utils.get_ampl_version_from_json(metadata_path)[source]¶

Parse model_metadata.json to get the AMPL version

Args:: filename (str): tar file
Returns:: the AMPL version number

utils.model_version_utils.get_ampl_version_from_model(filename)[source]¶

Get the AMPL version from the tar file’s model_metadata.json

Args:: filename (str): tar file
Returns:: the AMPL version number

utils.model_version_utils.get_major_version(full_version)[source]¶

utils.model_version_utils.main(argv)[source]¶

utils.model_version_utils.validate_version(input)[source]¶

utils.pubchem_utils module¶

utils.pubchem_utils.download_SID_from_bioactivity_assay(bioassayid)[source]¶

Retrieve summary info on bioactivity assays.

Args:: a single bioactivity id: PubChem AIDs (bioactivity assay ids)
Returns:: Returns the sids tested on this assay

utils.pubchem_utils.download_activitytype(aid, sid)[source]¶

Retrieve data for assays for a select list of sids.

Args:

myList (list): a bioactivity id (aid)

sidlst (list): list of sids specified as integers

Returns:

Nothing returned yet, will return basic stats to help decide whether to use assay or not

utils.pubchem_utils.download_bioactivity_assay(myList, intv=1)[source]¶

Retrieve summary info on bioactivity assays.

Args:

myList (list): List of PubChem AIDs (bioactivity assay ids)

intv (1): number of INCHIKEYS to submit queries for in one request, default is 1

Returns:

Nothing returned yet, will return basic stats to help decide whether to use assay or not

utils.pubchem_utils.download_dose_response_from_bioactivity(aid, sidlst)[source]¶

Retrieve data for assays for a select list of sids.

Args:

myList (list): a bioactivity id (aid)

sidlst (list): list of sids specified as integers

Returns:

Nothing returned yet, will return basic stats to help decide whether to use assay or not

utils.pubchem_utils.download_smiles(myList, intv=1)[source]¶

Retrieve canonical SMILES strings for a list of input INCHIKEYS. Will return only one SMILES string per INCHIKEY. If there are multiple values returned, the first is retained and the others are returned in a the discard_lst. INCHIKEYS that fail to return a SMILES string are put in the fail_lst

Args:

myList (list): List of INCHIKEYS

intv (1): number of INCHIKEYS to submit queries for in one request, default is 1

Returns:

list of SMILES strings corresponding to INCHIKEYS

list of INCHIKEYS, which failed to return a SMILES string

list of CIDs and SMILES, which were returned beyond the first CID and SMILE found for input INCHIKEY

utils.rdkit_easy module¶

Utilities for clustering and visualizing compound structures using RDKit.

utils.rdkit_easy.add_mol_column(df, smiles_col, molecule_col='mol')[source]¶

Converts SMILES strings in a data frame to RDKit Mol objects and adds them as a new column in the data frame.

Args:

df (pd.DataFrame): Data frame to add column to.

smiles_col (str): Column containing SMILES strings.

molecule_col (str): Name of column to create to hold Mol objects.

Returns:

pd.DataFrame: Modified data frame.

utils.rdkit_easy.calculate_descriptors(df, molecule_column='mol')[source]¶

Uses RDKit to compute various descriptors for compounds specified by Mol objects in the given data frame.

Args:

df (pd.DataFrame): Data frame containing molecules.

molecule_column (str): Name of column containing Mol objects for compounds.

Returns:

pd.DataFrame: Modified data frame with added columns for the descriptors.

utils.rdkit_easy.cluster_dataframe(df, molecule_column='mol', cluster_column='cluster', cutoff=0.2)[source]¶

Performs Butina clustering on compounds specified by Mol objects in a data frame.

Modifies the input dataframe to add a column ‘cluster_column’ containing the cluster index for each molecule.

From RDKit cookbook http://rdkit.org/docs_temp/Cookbook.html.

Args:

df (pd.DataFrame): Data frame containing compounds to cluster.

molecule_column (str): Name of column containing rdkit Mol objects for compounds.

cluster_column (str): Column that will be created to hold cluster indices.

cutoff (float): Maximum Tanimoto distance parameter used by Butina algorithm to identify neighbors of each molecule.

Returns:

None. Input data frame will be modified in place.

utils.rdkit_easy.cluster_fingerprints(fps, cutoff=0.2)[source]¶

Performs Butina clustering on compounds specified by a list of fingerprint bit vectors.

From RDKit cookbook http://rdkit.org/docs_temp/Cookbook.html.

Args:

fps (list of rdkit.ExplicitBitVect): List of fingerprint bit vectors.

cutoff (float): Cutoff distance parameter used to seed clusters in Butina algorithm.

Returns:

tuple of tuple: Indices of fingerprints assigned to each cluster.

utils.rdkit_easy.compute_drug_likeness(df, molecule_column='mol')[source]¶

Compute various molecular descriptors and drug-likeness criteria for compounds specified by RDKit Mol objects. The descriptors are added to the input data frame, and are limited to those used to compute the Lipinski rule-of-five, Ghose and Veber drug-likeness filters. The QED (qualitative estimate of drug-likeness) score is also added to the data frame, along with columns of booleans indicating whether the various sets of filter criteria are met.

Args:

df (pandas.DataFrame): Input DataFrame containing RDKit Mol objects. molecule_column (str): Name of the column in the DataFrame that contains the RDKit Mol objects. Default is ‘mol’.

Returns:

pandas.DataFrame: A copy of the input DataFrame with additional columns for the computed descriptors:

MolWt: Molecular weight
LogP: Logarithm of the partition coefficient between n-octanol and water
NumHDonors: Number of hydrogen bond donors
NumHAcceptors: Number of hydrogen bond acceptors
TPSA: Topological polar surface area
NumRotatableBonds: Number of rotatable bonds
MolarRefractivity: Molar refractivity
QED: Quantitative estimate of drug-likeness
TotalAtoms: Total number of atoms
Lipinski: Boolean indicating if the molecule meets Lipinski’s rule of five criteria
Ghose: Boolean indicating if the molecule meets Ghose filter criteria
Veber: Boolean indicating if the molecule meets Veber’s rule criteria

utils.rdkit_easy.matching_atoms_and_bonds(mol, match_mol)[source]¶

Returns lists of indices of atoms and bonds within molecule mol that are part of the substructure matched by match_mol.

Args:

mol (rdkit.Chem.Mol): Object representing molecule.

match_mol (rdkit.Chem.Mol): Object representing a substructure or SMARTS pattern to be compared against mol, typically created by Chem.MolFromSmiles() or Chem.MolFromSmarts().

Returns:

match_atoms, match_bonds (tuple(list(int), list(int))): Lists of indices of atoms and bonds within mol contained in the substructure (if any) matched by match_mol. Returns empty lists if there is no match.

utils.rdkit_easy.mol_to_html(mol, highlight=None, name='', type='svg', directory='rdkit_svg', embed=False, width=400, height=200)[source]¶

Creates an image displaying the given molecule’s 2D structure, and generates an HTML tag for it. The image can be embedded directly into the HTML tag or saved to a file.

Args:

mol (rdkit.Chem.Mol): Object representing molecule.

highlight (rdkit.Chem.Mol): Optional object representing a set of atoms and bonds to be highlighted in the image.

name (str): Filename of image file to create, relative to ‘directory’; only used if embed=False.

type (str): Image format; must be ‘png’ or ‘svg’.

directory (str): Path relative to notebook directory of subdirectory where image file will be written. The directory will be created if necessary. Note that absolute paths will not work in notebooks. Ignored if embed=True.

embed (bool): If True, image data will be embedded in the generated HTML tag. Otherwise it will be written to a file determined by the directory and name arguments.

width (int): Width of image bounding box.

height (int): Height of image bounding box.

Returns:

str: HTML image tag referencing the image file.

utils.rdkit_easy.mol_to_pil(mol, size=(400, 200), highlight=None)[source]¶

Returns a Python Image Library (PIL) object containing an image of the given molecule’s structure.

Args:

mol (rdkit.Chem.Mol): Object representing molecule.

size (tuple): Width and height of bounding box of image.

highlight (rdkit.Chem.Mol): Object representing substructure to highlight on molecule.

Returns:

PIL.PngImageFile: An object containing an image of the molecule’s structure.

utils.rdkit_easy.mol_to_svg(mol, size=(400, 200), highlight=None)[source]¶

Returns a RDKit MolDraw2DSVG object containing an image of the given molecule’s structure.

Args:

mol (rdkit.Chem.Mol): Object representing molecule.

size (tuple): Width and height of bounding box of image.

highlight (rdkit.Chem.Mol): Object representing substructure to highlight on molecule.

Returns:

RDKit.rdMolDraw2D.MolDraw2DSVG text (str): An SVG object containing an image of the molecule’s structure.

utils.rdkit_easy.save_png(mol, name, size=(400, 200), highlight=None)[source]¶

Draws the molecule mol into a PNG file with filename ‘name’ and with the given size in pixels.

Args:

mol (rdkit.Chem.Mol): Object representing molecule.

name (str): Path to write PNG file to.

size (tuple): Width and height of bounding box of image.

highlight (rdkit.Chem.Mol): Object representing substructure to highlight on molecule.

utils.rdkit_easy.save_svg(mol, name, size=(400, 200), highlight=None)[source]¶

Draws the molecule mol into an SVG file with filename ‘name’ and with the given size in pixels.

Args:

mol (rdkit.Chem.Mol): Object representing molecule.

name (str): Path to write SVG file to.

size (tuple): Width and height of bounding box of image.

highlight (rdkit.Chem.Mol): Object representing substructure to highlight on molecule.

utils.rdkit_easy.setup_notebook()[source]¶: Set up current notebook for displaying plots and Bokeh output using full width of window

utils.rdkit_easy.show_df(df)[source]¶

Convenience function to display a pandas DataFrame in the current notebook window with HTML images rendered in table cells.

Args:: df (pd.DataFrame): Data frame to display.
Returns:: None

utils.rdkit_easy.show_html(html)[source]¶

Convenience function to display an HTML image specified by image tag ‘html’.

Args:: html (str): HTML image tag to render.
Returns:: None

utils.split_response_dist_plots module¶

Module to plot distributions of response values in each subset of a dataset generated by a split

utils.split_response_dist_plots.compute_split_subset_wasserstein_distances(params)[source]¶

Compute the Wasserstein (“earth-moving”) distance between the distributions of the response variable(s) in the validation and test sets to that of the training set. In the case of a k-fold CV split, compare the distributions of folds 1 through k-1 and the test set to that of fold 0.

Args:: params (argparse.Namespace or dict): Structure containing dataset and split parameters. The following parameters are required, if not set to default values:

- dataset_key

- split_uuid

- split_strategy

- splitter

- split_valid_frac

- split_test_frac

- num_folds

- smiles_col

- response_cols
Returns:: (DataFrame): A table of Wasserstein distances relative to the training set or fold 0 for each other split subset, for each response variable.

utils.split_response_dist_plots.get_split_labeled_dataset(params)[source]¶

Add a column to a dataset labeling the split subset for each row. Given a dataset and split parameters (including split_uuid) referenced in params, returns a data frame containing the dataset with an extra ‘split_subset’ column indicating the subset each data point belongs to. For standard 3-way splits, the labels will be ‘train’, ‘valid’ and ‘test’. For a k-fold CV split, the labels will be ‘fold_0’ through ‘fold_<k-1>’ and ‘test’.

Args:: params (argparse.Namespace or dict): Structure containing dataset and split parameters. The following parameters are required, if not set to default values:

- dataset_key

- split_uuid

- split_strategy

- splitter

- split_valid_frac

- split_test_frac

- num_folds

- smiles_col

- response_cols
Returns:: A tuple (dset_df, split_label):

- dset_df (DataFrame): The dataset specified by params.dataset_key, with additional column split_subset.

- split_label (str): A short description of the split, useful for plot labeling.

utils.split_response_dist_plots.plot_split_subset_response_distrs(params, axes=None, plot_size=7)[source]¶

Plot the distributions of the response variable(s) in each split subset of a dataset.

Args:

params (argparse.Namespace or dict): Structure containing dataset and split parameters. The following parameters are required, if not set to default values:

- dataset_key
- split_uuid
- split_strategy
- splitter
- split_valid_frac
- split_test_frac
- num_folds
- smiles_col
- response_cols

axes (matplotlib.Axes): Axes to draw plots in, if provided plot_size (float): Height of plots; ignored if axes is provided

Returns:

None

utils.struct_utils module¶

Functions to manipulate and convert between various representations of chemical structures: SMILES, InChi and RDKit Mol objects. Many of these functions (those with a ‘workers’ argument) accept either a single SMILES or InChi string or a list of strings as their first argument, and return a value with the same datatype. If a list is passed and the ‘workers’ argument is > 1, the calculation is parallelized across multiple threads; this can save significant time when operating on thousands of molecules.

utils.struct_utils.base_mol_from_inchi(inchi_str, useIsomericSmiles=True, removeCharges=False)[source]¶

Generate a standardized RDKit Mol object for the largest fragment of the molecule specified by InChi string inchi_str. Replace any rare isotopes with the most common ones for each element. If removeCharges is True, add hydrogens as needed to eliminate charges.

Args:

inchi_str (str): InChi string representing molecule.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated string.

removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.

Returns:

str: Standardized salt-stripped SMILES string.

utils.struct_utils.base_mol_from_smiles(orig_smiles, useIsomericSmiles=True, removeCharges=False)[source]¶

Generate a standardized RDKit Mol object for the largest fragment of the molecule specified by orig_smiles. Replace any rare isotopes with the most common ones for each element. If removeCharges is True, add hydrogens as needed to eliminate charges.

Args:

orig_smiles (str): SMILES string to standardize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated string.

removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.

Returns:

str: Standardized salt-stripped SMILES string.

utils.struct_utils.base_smiles_from_inchi(inchi_str, useIsomericSmiles=True, removeCharges=False, workers=1)[source]¶

Generate standardized salt-stripped SMILES strings for the largest fragments of each molecule represented by InChi string(s) inchi_str. Replaces any rare isotopes with the most common ones for each element.

Args:

inchi_str (list or str): List of InChi strings to convert.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.

removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.

workers (int): Number of parallel threads to use for calculation.

Returns:

list or str: Standardized SMILES strings.

utils.struct_utils.base_smiles_from_smiles(orig_smiles, useIsomericSmiles=True, removeCharges=False, useCanonicalTautomers=False, workers=1)[source]¶

Generate standardized SMILES strings for the largest fragments of each molecule specified by orig_smiles. Strips salt groups and replaces any rare isotopes with the most common ones for each element.

Args:

orig_smiles (list or str): List of SMILES strings to canonicalize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.

removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.

useCanonicalTautomers (bool): Whether to convert the generated SMILES to their canonical tautomers. Defaults to False for backward compatibility.

workers (int): Number of parallel threads to use for calculation.

Returns:

list or str: Canonicalized SMILES strings.

utils.struct_utils.canonical_tautomers_from_smiles(smiles)[source]¶

Returns SMILES strings for the canonical tautomers of a SMILES string or list of SMILES strings

Args:: smiles (list or str): List of SMILES strings.
Returns:: (list of str) : List of SMILES strings for the canonical tautomers.

utils.struct_utils.draw_structure(smiles_str, image_path, image_size=500)[source]¶

Draw structure for the compound with the given SMILES string as a PNG file.

Note that there are more flexible functions for drawing structures in the rdkit_easy module. This function is only retained for backward compatibility.

Args:

smiles_str (str): SMILES representation of compound.

image_path (str): Filepath for image file to be generated.

image_size (int): Width of square bounding box for image.

Returns:

None.

utils.struct_utils.fix_moe_smiles(smiles)[source]¶

Correct the SMILES strings generated by MOE to standardize the representation of protonated atoms, so that RDKit can read them.

Args:: smiles (str): SMILES string.
Returns:: str: The corrected SMILES string.

utils.struct_utils.get_rdkit_smiles(orig_smiles, useIsomericSmiles=True)[source]¶

Given a SMILES string, regenerate a “canonical” SMILES string for the same molecule using the implementation in RDKit.

Args:

orig_smiles (str): SMILES string to canonicalize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated string.

Returns:

str: Canonicalized SMILES string.

utils.struct_utils.kekulize_smiles(orig_smiles, useIsomericSmiles=True, workers=1)[source]¶

Generate Kekulized SMILES strings for the molecules specified by orig_smiles. Kekulized SMILES strings are ones in which aromatic rings are represented by uppercase letters with alternating single and double bonds, rather than lowercase letters; they are needed by some external applications.

Args:

orig_smiles (list or str): List of SMILES strings to Kekulize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.

workers (int): Number of parallel threads to use for calculation.

Returns:

list or str: Kekulized SMILES strings.

utils.struct_utils.mol_wt_from_smiles(smiles, workers=1)[source]¶

Calculate molecular weights for molecules represented by SMILES strings.

Args:

smiles (list or str): List of SMILES strings.

workers (int): Number of parallel threads to use for calculations.

Returns:

list or float: Molecular weights. NaN is returned for SMILES strings that could not be read by RDKit.

utils.struct_utils.mols_from_smiles(orig_smiles, workers=1)[source]¶

Parallel function to create RDKit Mol objects for a list of SMILES strings. If orig_smiles is a list and workers is > 1, spawn ‘workers’ processes to convert input SMILES strings to Mol objects.

Args:

orig_smiles (list or str): List of SMILES strings to convert to Mol objects.

workers (int): Number of parallel threads to use for calculation.

Returns:

list of rdkit.Chem.Mol: RDKit objects representing molecules.

utils.struct_utils.rdkit_smiles_from_smiles(orig_smiles, useIsomericSmiles=True, useCanonicalTautomers=False, workers=1)[source]¶

Parallel version of get_rdkit_smiles. If orig_smiles is a list and workers is > 1, spawn ‘workers’ processes to convert input SMILES strings to standardized RDKit format.

Args:

orig_smiles (list or str): List of SMILES strings to canonicalize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.

useCanonicalTautomers (bool): Whether to convert the generated SMILES to their canonical tautomers. Defaults to False for backward compatibility.

workers (int): Number of parallel threads to use for calculation.

Returns:

list or str: Canonicalized SMILES strings.

utils.struct_utils.smiles_to_inchi_key(smiles)[source]¶

Generates an InChI key from a SMILES string. Note that an InChI key is different from an InChI string; it can be used as a unique identifier, but doesn’t hold the information needed to reconstruct a molecule.

Args:: smiles (str): SMILES string.
Returns:: str: An InChI key. Returns None if RDKit cannot convert the SMILES string to an RDKit Mol object.

utils package¶

Submodules¶

utils.compare_splits_plots module¶

utils.curate_data module¶

utils.data_curation_functions module¶

utils.datastore_functions module¶

utils.generate_transformers module¶

utils.hyperparam_search_wrapper module¶

utils.many_to_one module¶

utils.model_file_reader module¶

utils.model_retrain module¶

utils.model_version_utils module¶

utils.pubchem_utils module¶

utils.rdkit_easy module¶

utils.split_response_dist_plots module¶

utils.struct_utils module¶

Module contents¶