utils package¶

Submodules¶

utils.compare_splits_plots module¶

class utils.compare_splits_plots.SplitStats(total_df, split_df, smiles_col, id_col, response_cols)[source]¶

Bases: object

This object manages a dataset and a given split dataframe.

dist_hist_plot(dists, title, dist_path='')[source]¶

Creates a histogram of pairwise Tanimoto distances between training and test sets

Args:

dist_path (str): Optional Where to save the plot. The string ‘_dist_hist’ will be: appended to this input

dist_hist_train_v_test_plot(ax=None)[source]¶

Plots Tanimoto differences between training and valid subsets

Returns:: g (Seaborn FacetGrid): FacetGrid object from seaborn

dist_hist_train_v_valid_plot(ax=None)[source]¶

Plots Tanimoto differences between training and valid subsets

Returns:: g (Seaborn FacetGrid): FacetGrid object from seaborn

make_all_plots(dist_path='')[source]¶

Makes a series of diagnostic plots

Args:

dist_path (str): Optional Where to save the plot. The string ‘_frac_box’ will be: appended to this input

print_stats()[source]¶: Prints useful statistics to stdout

subset_frac_plot(dist_path='')[source]¶

Makes a box plot of the subset fractions

Args:

dist_path (str): Optional Where to save the plot. The string ‘_frac_box’ will be: appended to this input

umap_plot(dist_path='')[source]¶

Plots the first 10000 samples in Umap space using Morgan Fingerprints

Args:

dist_path (str): Optional Where to save the plot. The string ‘_umap_scatter’ will be: appended to this input

utils.compare_splits_plots.parse_args()[source]¶

utils.compare_splits_plots.save_figure(filename)[source]¶

Saves a figure to disk. Saves both png and svg formats.

Args:: filename (str): The name of the figure.

utils.compare_splits_plots.split(total_df, split_df, id_col)[source]¶

Splits a dataset into training, test and validation sets using a given split.

Args:

total_df (DataFrame): A pandas dataframe. split_df (DataFrame): A split dataframe containing ‘cmpd_id’ and ‘subset’ columns. id_col (str): The ID column in total_df

Returns:

(DataFrame, DataFrame, DataFrame): Three dataframes for train, test, and valid: respectively.

utils.curate_data module¶

Utility functions used for AMPL dataset curation and creation.

utils.curate_data.add_classification_column(thresholds, value_column, label_column, data, right_inclusive=True)[source]¶

Add a classification column to a DataFrame.

Add a classification column ‘label_column’ to DataFrame ‘data’ based on values in ‘value_column’, according to a sequence of thresholds. The number of classes is one plus the number of thresholds.

Args:

thresholds (float or sequence of floats): Thresholds to use to assign class labels. Label i will: be assigned to values such that thresholds[i-1] < value <= thresholds[i] (if right_inclusive is True) or thresholds[i-1] <= value < thresholds[i] (otherwise).

value_column (str): Name of the column from which class labels are derived.

label_column (str): Name of the new column to be created for class labels.

data (DataFrame): DataFrame holding all data.

right_inclusive (bool): Whether the thresholding intervals are closed on the right or on the left.: Set this False to get the same behavior as add_binary_tertiary_classification. The default behavior is preferred for the common case where the classification is based on a left-censoring threshold.

Returns:

DataFrame: DataFrame updated to include class label column.

utils.curate_data.aggregate_assay_data(assay_df, value_col='VALUE_NUM', output_value_col=None, label_actives=True, active_thresh=None, id_col='CMPD_NUMBER', smiles_col='rdkit_smiles', relation_col='VALUE_FLAG', date_col=None, verbose=False)[source]¶

Aggregates replicated values in assay data

Map RDKit SMILES strings in assay_df to base structures, then compute an MLE estimate of the mean value over replicate measurements for the same SMILES strings, taking censoring into account. Generate an aggregated result table with one value for each unique base SMILES string, to be used in an ML-ready dataset.

Args:

assay_df (DataFrame): The input DataFrame to be processed.

value_col (str): The column in the DataFrame containing assay values to be averaged.

output_value_col (str): Optional; the column name to use in the output DataFrame for the averaged data.

label_actives (bool): If True, generate an additional column ‘active’ indicating whether the mean value is above a threshold specified by active_thresh.

active_thresh (float): The threshold to be used for labeling compounds as active or inactive.: If active_thresh is None (the default), the threshold used is the minimum reported value across all records with left-censored values (i.e., those with ‘<’ in the relation column.

id_col (str): The input DataFrame column containing compound IDs.

smiles_col (str): The input DataFrame column containing SMILES strings.

relation_col (str): The input DataFrame column containing relational operators (<, >, etc.).

date_col (str): The input DataFrame column containing dates when the assay data was uploaded. If not None, the code will assign the earliest: date among replicates to the aggregate data record.

Returns:

A DataFrame containing averaged assay values, with one value per compound.

utils.curate_data.average_and_remove_duplicates(column, tolerance, list_bad_duplicates, data, max_stdev=100000, compound_id='CMPD_NUMBER', rm_duplicate_only=False, smiles_col='rdkit_smiles_parent')[source]¶

This while loop loops through until no ‘bad duplicates’ are left.

This function removes duplicates based on max_stdev and tolerance. If the value in data[column] falls too far from the mean based on tolerance and max_stdev then that entry is removed. This is repeated until all bad entries are removed

Args:

column (str): column with the value of interest

tolerance (float): acceptable % difference between value and average: ie.: if “[(value - mean)/mean*100]>tolerance” then remove data row

list_bad_duplicates (str): ‘Yes’ to list the bad duplicates

data (DataFrame): input DataFrame

max_stdev (float): maximum standard deviation threshold

compound_id (str): column containing compound ids

rm_duplicate_only (bool): only remove bad duplicates, don’t average good ones, the resulting table can be fed into aggregate assay data to further process.: note: The mean is recalculated on each loop through to make sure it isn’t skewed by the ‘bad duplicate’ values

smiles_col (str): column containing base rdkit smiles strings

Returns:

DataFrame: Returns remaining rows after all bad duplicates have been removed.

utils.curate_data.create_new_rows_for_extra_results(extra_result_col, value_col, data)[source]¶

Moves results from an extra column to an existing column

Returns a new DataFrame with values from ‘extra_result_col’ appended to the end of ‘value_col’. NaN values in ‘extra_result_col’ are dropped. ‘Extra_result_col’ is dropped from the resulting DataFrame

Args:

extra_result_col (str): A column in ‘data’.

value_col (str): A column in ‘data’.

data (DataFrame):

Returns:

DataFrame

utils.curate_data.filter_in_by_column_values(column, values, data)[source]¶

Include rows only for given values in specified column.

Filters in all rows in data if row[column] in values.

Args:

column (str): Name of a column in data.

values (iterable): An iterable, Series, DataFrame, or dict of values: contained in data[column].

data (DataFrame): A DataFrame.

Returns:

DataFrame: DataFrame containing filtered rows.

utils.curate_data.filter_in_out_by_column_values(column, values, data, in_out)[source]¶

Include rows only for given values in specified column.

Given a DataFrame, column, and an iterable, Series, DataFrame, or dict, of values, return a DataFrame with rows containing value in values or all rows that do not containe a value in values.

Args:

column (str): Name of a column in data.

values (iterable): An iterable, Series, DataFrame, or dict of values: contained in data[column].

data (DataFrame): A DataFrame.

in_out (str): If set to ‘in’, will filter in rows that contain a value: in values. If set to anything else, this function will filter out rows that contian a value in values.

Returns:

DataFrame: DataFrame containing filtered rows.

utils.curate_data.filter_out_by_column_values(column, values, data)[source]¶

Exclude rows only for given values in specified column.

Filters out all rows in data if row[column] in values.

Args:

column (str): Name of a column in data.

values (iterable): An iterable, Series, DataFrame, or dict of values: contained in data[column].

data (DataFrame): A DataFrame.

Returns:

DataFrame: DataFrame containing filtered rows.

utils.curate_data.filter_out_comments(values, values_cs, data)[source]¶

Remove rows that contain the text listed

Removes any rows where data[‘COMMENTS’] contains the words in values or values_cs. Used for removing results that indicate bad data in the comments.

Args:

values (str): list of values that are not case sensitive

values_cs (str): list of values that are case sensitive

data (DataFrame): DataFrame containing a column named ‘COMMENTS’

Returns:

DataFrame: Returns a DataFrame with the remaining rows

utils.curate_data.freq_table(dset_df, column, min_freq=1)[source]¶

Generate a DataFrame tabluating the repeat requencies of unique values.

Generate a DataFrame tabulating the repeat frequencies of each unique value in ‘column’. Restrict it to values occurring at least min_freq times.

Args:

dset_df (DataFrame): An input DataFrame

column (str): The name of one column in DataFrame

min_freq (int): Restrict unique count to at least min_freq times.

Returns:

DataFrame: Dataframe containing two columns: the column passed in as the ‘column’ argument: and the column ‘Count’. The ‘Count’ column contains the number of occurances for each value in the ‘column’ argument.

utils.curate_data.get_rdkit_smiles_parent(data)[source]¶

Strip the salts off the rdkit SMILES strings

First, loops through data and determines the base/parent smiles string for each row. Appends the base smiles string to a new row in a list. Then adds the list as a new column, ‘rdkit_smiles_parent’, in ‘data’. Basically calls base_smiles_from_smiles for each smile in the column ‘rdkit_smiles’

Args:: data (DataFrame): A DataFrame with a column named ‘rdkit_smiles’.
Returns:: DataFrame with column ‘rdkit_smiles_parent’ with salts stripped

utils.curate_data.labeled_freq_table(dset_df, columns, min_freq=1)[source]¶

Generate a frequency table in which additional columns are included.

Generate a frequency table in which additional columns are included. The first column in ‘columns’ is assumed to be a unique ID; there should be a many-to-1 mapping from the ID to each of the additional columns.

Args:

dset_df (DataFrame): The input DataFrame.

columns (list(str)): A list of columns to include in the output frequency table.: The first column in ‘columns’ is assumed to be a unique ID; there should be a many-to-1 mapping from the ID to each of the additional columns.

min_freq (int): Restrict unique count to at least min_freq times.

Returns:

DataFrame: A DataFrame containing a frequency table.

Raises:

Exception: If the DataFrame violates the rule: there should be a many-to-1: mapping from the ID to each of the additional columns.

utils.curate_data.mle_censored_mean(cmpd_df, std_est, value_col='PIC50', relation_col='relation')[source]¶

Computes maximum likelihood estimate of the true mean value for a single replicated compound.

Compute a maximum likelihood estimate of the true mean value underlying the distribution of replicate assay measurements for a single compound. The data may be a mix of censored and uncensored measurements, as indicated by the ‘relation’ column in the input DataFrame cmpd_df. std_est is an estimate for the standard deviation of the distribution, which is assumed to be Gaussian; we typically compute a common estimate for the whole dataset using replicate_rmsd().

Args:

cmpd_df (DataFrame): DataFrame containing measurements and SMILES strings.

std_est (float): An estimate for the standard deviation of the distribution.

smiles_col (str): Name of the column that contains SMILES strings.

value_col (str): Name of the column that contains target values.

relation_col (str): The input DataFrame column containing relational operators (<, >, etc.).

Returns:

float: maximum likelihood estimate of the true mean for a replicated compound str: Relation, ‘’ not censored, ‘>’ right censored, ‘<’ left censored

utils.curate_data.remove_outlier_replicates(df, response_col='pIC50', id_col='compound_id', max_diff_from_median=1.0)[source]¶

Examine groups of replicate measurements for compounds identified by compound ID and compute median response for each group. Eliminate measurements that differ by more than a given value from the median; note that in some groups this will result in all replicates being deleted. This function should be used together with aggregate_assay_data instead of average_and_remove_duplicates to reduce data to a single value per compound.

Args:

df (DataFrame): Table of compounds and response data

response_col (str): Column containing response values

id_col (str): Column that uniquely identifies compounds, and therefore measurements to be treated as replicates.

max_diff_from_median (float): Maximum absolute difference from median value allowed for retained replicates.

Returns:

result_df (DataFrame): Filtered data frame with outlier replicates removed.

utils.curate_data.replicate_rmsd(dset_df, smiles_col='base_rdkit_smiles', value_col='PIC50', relation_col='relation', default_val=1.0)[source]¶

Compute RMS deviation of all replicate uncensored measurements from means

Compute RMS deviation of all replicate uncensored measurements in dset_df from their means. Measurements are treated as replicates if they correspond to the same SMILES string, and are considered censored if the relation column contains > or <. The resulting value is meant to be used as an estimate of measurement error for all compounds in the dataset.

Args:

dset_df (DataFrame): DataFrame containing uncensored measurements and SMILES strings.

smiles_col (str): Name of the column that contains SMILES strings.

value_col (str): Name of the column that contains target values.

relation_col (str): The input DataFrame column containing relational operators (<, >, etc.).

default_val (float): The value to return if there are no compounds with replicate measurements.

Returns:

float: returns root mean squared deviation of all replicate uncensored measurements

utils.curate_data.set_group_permissions(path, system='AD', owner='GSK')[source]¶

Sets file and group permissions to standard values for a dataset containing proprietary data owned by ‘owner’. Later we may add a ‘public’ option, or groups for data from other pharma companies.

Args:

path (string): File path

system (string): Computing environment from which group ownerships will be derived; currently, either ‘LC’ for LC filesystems or ‘AD’ for LLNL systems where owners and groups are managed by Active Directory.

owner (string): Who the data belongs to, either ‘public’ or the name of a company (e.g. ‘GSK’) associated with a restricted access group.

Returns:

None

utils.curate_data.summarize_data(column, num_bins, title, units, filepath, data, log_column='No')[source]¶

Summarizes the in data[column]

Summarizes the data by printing mean, stdev, max, and min of the data. Creates plots of the binned values in data[column]. If log_column != ‘No’ this also creates plots that compares normal and log distributions of the data.

Args:

column (str): Column of interest.

num_bins (int): Number of bins in the histogram.

title (str): Title of the histogram.

units (str): Units for values in ‘column’.

filepath (str): This file path gets printed to the console.

data (DataFrame): Input DataFrame.

log_column (str): Defaults to ‘No’. Any other value will generate: a plot comparing normal and log distributions.

Returns:

None

utils.curate_data.xc50topxc50_for_nm(x)[source]¶

Convert XC50 values measured in nanomolars to -log10 (PX50)

Args :: x (float): input XC50 value measured in nanomolars
Returns :: float: -log10 value of x

utils.data_curation_functions module¶

data_curation_functions.py

Extract Kevin’s functions for curation of public datasets Modify them to match Jonathan’s curation methods in notebook 01/30/2020

utils.data_curation_functions.atom_curation(targ_lst, smiles_lst, shared_inchi_keys)[source]¶

Apply ATOM standard ‘curation’ step to “shared_df”: Average replicate assays, remove duplicates and drop cases with large variance between replicates. mleqonly

Args:

targ_lst (list): A list of targets.

smiles_lst (list): A list of DataFrames.: These DataFrames must contain the columns gene_names, standard_type, standard_relation, standard_inchi_key, PIC50, and rdkit_smiles

shared_inchi_keys (list): A list of inchi keys used in this dataset.

Returns:

list, list:A list of curated DataFrames and a list of the number of compounds: dropped during the curation process for each target.

utils.data_curation_functions.atom_curation_excape(targ_lst, smiles_lst, shared_inchi_keys)[source]¶

Apply ATOM standard ‘curation’ step: Average replicate assays, remove duplicates and drop cases with large variance between replicates. Rows with NaN values in rdkit_smiles, VALUE_NUM_mean, and pXC50 are dropped

Args:

targ_lst (list): A list of targets.

smiles_lst (list): A of DataFrames.: These DataFrames must contain the columns gene_names, standard_type, standard_relation, standard_inchi_key, pXC50, and rdkit_smiles

shared_inchi_keys (list): A list of inchi keys used in this dataset.

Returns:

list:A list of curated DataFrames

utils.data_curation_functions.compute_negative_log_responses(df, unit_col='unit', value_col='value', new_value_col='average_col', relation_col=None, new_relation_col=None, unit_conv={'nM': <function <lambda>>, 'uM': <function <lambda>>}, inplace=False)[source]¶

Given the response values in value_col (IC50, Ki, Kd, etc.), compute their negative base 10 logarithms (pIC50, pKi, pKd, etc.) after converting them to molar units and store them in new_value_col. If relation_col is provided, replace any ‘<’ or ‘>’ relations with their opposites and store the result in new_relation_col (if provided), or in relation_col if note. Rows where the original value is 0 or negative will be dropped from the dataset.

Args:

df (DataFrame): A DataFrame that contains value_col, unit_col and relation_col.

unit_conv (dict): A dictionary mapping concentration units found in unit_col to functions that convert the corresponding concentrations to molar. The default handles micromolar and nanomolar units, represented as ‘uM’ and ‘nM’ respectively.

unit_col (str): Column containing units.

value_col (str): Column containing input values.

new_value_col (str): Column to receive converted values.

relation_col (str): Column containing relational operators for censored data.

new_relation_col (str): Column to receive inverted relations applicable to the negative log transformed values.

inplace (bool): If True, the input DataFrame is modified in place when possible. The default is to return a copy

Returns:

DataFrame: A table containing the transformed values and relations.

utils.data_curation_functions.convert_IC50_to_pIC50(df, unit_col='unit', value_col='value', new_value_col='average_col', relation_col=None, new_relation_col=None, unit_conv={'nM': <function <lambda>>, 'uM': <function <lambda>>}, inplace=False)[source]¶: For backward compatibiltiy only: equivalent to calling compute_negative_log_responses with the same arguments.

utils.data_curation_functions.down_select(df, kv_lst)[source]¶

Filters rows given a set of values

Given a DataFrame and a list of tuples columns (k) to values (v), this function filters out all rows where df[k] == v.

Args:

df (DataFrame): An input DataFrame.

kv_list (list): A list of tuples of (column, value)

Returns:

DataFrame: Rows where all df[k] == v

utils.data_curation_functions.exclude_organometallics(df, smiles_col='rdkit_smiles')[source]¶: Filters data frame df based on column smiles_col to exclude organometallic compounds

utils.data_curation_functions.filter_dtc_data(orig_df, geneNames)[source]¶

Extracts and post processes JAK1, 2, and 3 datasets from DTC

This is specific to the DTC database. Extract JAK1, 2 and 3 datasets from Drug Target Commons database, filtered for data usability. filter criteria:

gene_names == JAK1 | JAK2 | JAK3 InChi key not missing standard_type IC50 units NM standard_relation mappable to =, < or > wildtype_or_mutant != ‘mutated’ valid SMILES maps to valid RDKit base SMILES standard_value not missing pIC50 > 3

Args:

orig_df (DataFrame): Input DataFrame. Must contain the following columns: gene_names: standard_inchi_key, standard_type, standard_units, standard_value, compound_id, wildtype_or_mutant.

geneNames (list): A list of gene names to filter out of orig_df e.g. [‘JAK1’, ‘JAK2’].

Returns:

DataFrame: The filtered rows of the orig_df

utils.data_curation_functions.get_smiles_4dtc_data(nm_df, targ_lst, save_smiles_df)[source]¶

Returns SMILES strings from DTC data

nm_df must be a DataFrame from DTC with the following columns: gene_names, standard_type, standard_value, ‘standard_inchi_key’, and standard_relation.

This function selects all rows where nm_df[‘gene_names’] is in targ_lst, nm_df[‘standard_type’]==’IC50’, nm_df[‘standard_relation’]==’=’, and ‘standard_value’ > 0.

Then pIC50 values are calculated and added to the ‘PIC50’ column, and smiles strings are merged in from save_smiles_df

Args:

nm_df (DataFrame): Input DataFrame.

targ_lst (list): A list of targets.

save_smiles_df (DataFrame): A DataFrame with the column ‘standard_inchi_key’

Returns:

list, list, str: A list of smiles. A list of inchi keys shared between targets.: And a description of the targets

utils.data_curation_functions.get_smiles_dtc_data(nm_df, targ_lst, save_smiles_df)[source]¶

Returns SMILES strings from DTC data

nm_df must be a DataFrame from DTC with the following columns: gene_names, standard_type, standard_value, ‘standard_inchi_key’, and standard_relation.

This function selects all rows where nm_df[‘gene_names’] is in targ_lst, nm_df[‘standard_type’]==’IC50’, nm_df[‘standard_relation’]==’=’, and ‘standard_value’ > 0.

Then pIC50 values are calculated and added to the ‘PIC50’ column, and smiles strings are merged in from save_smiles_df

Args:

nm_df (DataFrame): Input DataFrame.

targ_lst (list): A list of targets.

save_smiles_df (DataFrame): A DataFrame with the column ‘standard_inchi_key’

Returns:

list, list: A list of smiles and a list of inchi keys shared between targets.

utils.data_curation_functions.get_smiles_excape_data(nm_df, targ_lst)[source]¶

Calculate base rdkit smiles

Divides up nm_df based on target and makes one DataFrame for each target.

Rows with NaN pXC50 values are dropped. Base rdkit SMILES are calculated from the SMILES column using atomsci.ddm.utils.struct_utils.base_rdkit_smiles_from_smiles. A new column, ‘rdkit_smiles, is added to each output DataFrame.

Args:

nm_df (DataFrame): DataFrame for Excape database. Should contain the columns,: pXC50, SMILES, and Ambit_InchiKey

targ_lst (list): A list of targets to filter out of nm_df

Returns:

list, list: A list of DataFrames, one for each target, and a list of: all inchi keys used in the dataset.

utils.data_curation_functions.ic50topic50(x)[source]¶

Calculates pIC50 from IC50

Args:: x (float): An IC50 in nanomolar (nM) units.
Returns:: float: The pIC50.

utils.data_curation_functions.is_organometallic(mol)[source]¶: Returns True if the molecule is organometallic

utils.data_curation_functions.set_data_root(dir)[source]¶

Set global variables for data directories

Creates paths for DTC and Excape given a root data directory. Global variables ‘data_root’ and ‘data_dirs’. ‘data_root’ is the root data directory. ‘data_dirs’ is a dictionary that maps ‘DTC’ and ‘Excape’ to directores calcuated from ‘data_root’

Args:: dir (str): root data directory containing folds ‘dtc’ and ‘excape’
Returns:: None

utils.data_curation_functions.standardize_relations(dset_df, db=None, rel_col=None, output_rel_col=None, invert=False)[source]¶

Standardizes censoring operators

Standardize the censoring operators to =, < or >, and remove any rows whose operators don’t map to a standard one. There is a special case for db=’ChEMBL’ that strips the extra “‘“s around relationship symbols. Assumes relationship columns are ‘Standard Relation’, ‘standard_relation’ and ‘activity_prefix’ for ChEMBL, DTC and GoStar respectively.

This function makes the following mappings: “>” to “>”, “>=” to “>”, “<” to “<”, “<=” to “<”, and “=” to “=”. All other relations are removed from the DataFrame.

Args:

dset_df (DataFrame): Input DataFrame. Must contain either ‘Standard Relation’: or ‘standard_relation’

db (str): Source database. Must be either ‘GoStar’, ‘DTC’ or ‘ChEMBL’. Required if rel_col is not specified.

rel_col (str): Column containing relational operators. If specified, overrides the default relation column: for db.
output_rel_col (str): If specified, put the standardized operators in a new column with this name and leave: the original operator column unchanged.
invert (bool): If true, replace the inequality operators with their inverses. This is useful when a reported: value such as IC50 is converted to its negative log such as pIC50.

Returns:

DataFrame: Dataframe with the standardized relationship sybmols

utils.data_curation_functions.upload_df_dtc_base_smiles_all(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_mleqonly_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC base smiles data to the datastore

Uploads base SMILES string for the DTC dataset.

Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’, the response column is set to PIC50, and the SMILES are assumed to be in ‘base_rdkit_smiles’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_mleqonly_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_dtc_mleqonly(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_smiles_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC mleqonly data to the datastore

Upload mleqonly data to the datastore from the given DataFrame. The DataFrame must contain the column ‘rdkit_smiles’ and ‘VALUE_NUM_mean’. This function is meant to upload data that has been aggregated using atomsci.ddm.utils.curate_data.average_and_remove_duplicates. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_smiles_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_dtc_mleqonly_class(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_mleqonly_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC mleqonly classification data to the datastore

Upload mleqonly classification data to the datastore from the given DataFrame. The DataFrame must contain the column ‘rdkit_smiles’ and ‘binary_class’. This function is meant to upload data that has been aggregated using atomsci.ddm.utils.curate_data.average_and_remove_duplicates and then thresholded to make a binary classification dataset. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_mleqonly_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_dtc_smiles(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, smiles_df, orig_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC smiles data to the datastore

Upload a raw dataset to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

smiles_df (DataFrame): DataFrame containing SMILES to be uploaded.

orig_fileID (str): Source file id used to generate smiles_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_dtc_smiles_regr_all_class(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_smiles_regr_all_fileID, smiles_column, data_origin='journal', species='human', force_update=False)[source]¶

Uploads DTC classification data to the datastore

Uploads binary classiciation data for the DTC dataset. Classnames are assumed to be ‘active’ and ‘inactive’

Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’, the response column is set to PIC50.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_smiles_regr_all_fileID(str): Source file id used to generate data_df.

smiles_column (str): Column containing SMILES.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_excape_mleqonly(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, smiles_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads Excape mleqonly data to the datastore

Upload mleqonly to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’, smiles_col is ‘rdkit_smiles’ and response_col is ‘VALUE_NUM_mean’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame containing SMILES to be uploaded.

smiles_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_excape_mleqonly_class(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, mleqonly_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads Excape mleqonly classification data to the datastore

data_df contains a binary classification dataset with ‘active’ and ‘incative’ classes.

Upload mleqonly classification to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’, smiles_col is ‘rdkit_smiles’ and response_col is ‘binary_class’.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame containing SMILES to be uploaded.

mleqonly_fileID (str): Source file id used to generate data_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_df_excape_smiles(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, smiles_df, orig_fileID, data_origin='journal', species='human', force_update=False)[source]¶

Uploads Excape SMILES data to the datastore

Upload SMILES to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

smiles_df (DataFrame): DataFrame containing SMILES to be uploaded.

orig_fileID (str): Source file id used to generate smiles_df.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_file_dtc_raw_data(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, file_path, data_origin='journal', species='human', force_update=False)[source]¶

Uploads raw DTC data to the datastore

Upload a raw dataset to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

file_path (str): The filepath of the dataset.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_file_dtc_smiles_regr_all(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, file_path, dtc_smiles_fileID, smiles_column, data_origin='journal', species='human', force_update=False)[source]¶

Uploads regression DTC data to the datastore

Uploads regression dataset for DTC dataset.

Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’, the response column is set to PIC50.

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

data_df (DataFrame): DataFrame to be uploaded.

dtc_smiles_fileID(str): Source file id used to generate data_df.

smiles_column (str): Column containing SMILES.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.data_curation_functions.upload_file_excape_raw_data(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, file_path, data_origin='journal', species='human', force_update=False)[source]¶

Uploads raw Excape data to the datastore

Upload a raw dataset to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’

Args:

dset_name (str): Name of the dataset. Should not include a file extension.

title (str): title of the file in (human friendly format)

description (str): long text box to describe file (background/use notes)

tags (list): Must be a list of strings.

functional_area (str): The functional area.

target (str): The target.

target_type (str): The target type of the dataset.

activity (str): The activity of the dataset.

assay_category (str): The assay category of the dataset.

file_path (str): The filepath of the dataset.

data_origin (str): The origin of the dataset e.g. journal.

species (str): The species of the dataset e.g. human, rat, dog.

force_update (bool): Overwrite existing datasets in the datastore.

Returns:

str: datastore OID of the uploaded dataset.

utils.datastore_functions module¶

utils.hyperparam_search_wrapper module¶

utils.many_to_one module¶

exception utils.many_to_one.ManyToOneException[source]¶: Bases: Exception

exception utils.many_to_one.NANCompoundIDException[source]¶: Bases: Exception

exception utils.many_to_one.NANSMILESException[source]¶: Bases: Exception

utils.many_to_one.has_nans(df, col)[source]¶

utils.many_to_one.many_to_one(fn, smiles_col, id_col)[source]¶

utils.many_to_one.many_to_one_df(df, smiles_col, id_col)[source]¶

AMPL requires that SMILES and compound_ids have a many to one mapping. This function opens the dataset and checks this restraint. It will also check if any SMILES or compound_ids are empty/nan

Arguments:

df (pd.DataFrame): The DataFrame in question. smiles_col (str): The column containing SMILES. id_col (str): The column containing compound ids

Returns:

True if there is a many to one mapping. Raises one of 3 errors if it:

Has nan compound_ids
Has nan SMILES
Is not a many to one mapping between compound_ids and SMILES

utils.many_to_one.no_nan_ids_or_smiles(df, smiles_col, id_col)[source]¶

utils.model_file_reader module¶

class utils.model_file_reader.ModelFileReader(data_file_path)[source]¶

Bases: object

A class to encapsulate a model’s metadata that you might want read out from a folder. Like read version number, get the dataset key, split uuid etc of a model.

Attributes:

Set in __init__:: data_file_path (str): a model data file or a directory that contains the model

get_dataset_key()[source]¶: Returns: (str): model dataset key

get_descriptor_type()[source]¶: Returns: (str): model descriptor type

get_featurizer()[source]¶: Returns: (str): model featurizer

get_id_col()[source]¶: Returns: (str): model id column

get_model_info()[source]¶

Extract the model metadata (and if applicable, model metrics)

Returns:: a dictionary of the most important model parameters and metrics.

get_model_parameters()[source]¶: Returns: (str): model parameters

get_model_type()[source]¶: Returns: (str): model type

get_model_uuid()[source]¶: Returns: (str): model uuid

get_response_cols()[source]¶: Returns: (str): model response columns

get_smiles_col()[source]¶: Returns: (str): model smile columns

get_split_csv()[source]¶: Returns: (str): model split csv

get_split_strategy()[source]¶: Returns: (str): model split strategy

get_split_uuid()[source]¶: Returns: (str): model split_uuid

get_splitter()[source]¶: Returns: (str): model splitter

get_splitting_parameters()[source]¶: Returns: (str): model splitting parameters

get_training_dataset()[source]¶: Returns: (str): model training dataset

get_version()[source]¶: Returns: (str): model version

utils.model_file_reader.get_multiple_models_metadata(*args)[source]¶

A function that takes model tar.gz file(s) and extract the metadata (and if applicable, model metrics)

Args:: *args: Variable length argument list of model tar.gz file(s)
Returns:: a list of models’ most important model parameters and metrics. or an empty array if it fails to parse the input file(s).
Exception:: IOError: Problem access the file or if fails to parse the input file to an AMPL model

utils.model_file_reader.main(argv)[source]¶

utils.model_retrain module¶

utils.model_retrain.main(argv)[source]¶

utils.model_retrain.train_model(input, output, dskey='', production=False)[source]¶

Retrain a model saved in a model_metadata.json file

Args:

input (str): path to model_metadata.json file

output (str): path to output directory

dskey (str): new dataset key if file location has changed

production (bool): retrain the model using production mode

Returns:

None

utils.model_retrain.train_model_from_tar(input, output, dskey='', production=False)[source]¶

Retrain a model saved in a tar.gz file

Args:

input (str): path to a tar.gz file

output (str): path to output directory

dskey (str): new dataset key if file location has changed

Returns:

None

utils.model_retrain.train_model_from_tracker(model_uuid, output_dir, production=False)[source]¶

Retrain a model saved in the model tracker, but save it to output_dir and don’t insert it into the model tracker

Args:

model_uuid (str): model tracker model_uuid file

output_dir (str): path to output directory

Returns:

the model pipeline object with trained model

utils.model_retrain.train_models_from_dataset_keys(input, output, pred_type='regression', production=False)[source]¶

Retrain a list of models from an input file

Args:

input (str): path to an Excel or csv file. the required columns are ‘dataset_key’ and ‘bucket’ (public, private_file or Filesystem).

output (str): path to output directory

pred_type (str, optional): set the model prediction type. if not, uses the default ‘regression’

Returns:

None

utils.model_version_utils module¶

model_version_utils.py

Misc utilities to get the AMPL version(s) used to train one or more models and check them for compatibility with the currently running version of AMPL.:

To check the model version

usage: model_version_utils.py [-h] -i INPUT

optional arguments:

-h, --help

show this help message and exit

-i INPUT, --input INPUT

input directory/file (required)

utils.model_version_utils.check_version_compatible(input, ignore_check=False)[source]¶

Compare the input file’s version against the running AMPL version to see if they are compatible

Args:: filename (str): file or version number
Returns:: True if the input model version matches the compatible AMPL version group

utils.model_version_utils.get_ampl_version()[source]¶

Get the running ampl version

Returns:: the AMPL version

utils.model_version_utils.get_ampl_version_from_dir(dirname)[source]¶

Get the AMPL versions for all the models stored under the given directory and its subdirectories, recursively.

Args:: dirname (str): directory
Returns:: list of AMPL versions

utils.model_version_utils.get_ampl_version_from_json(metadata_path)[source]¶

Parse model_metadata.json to get the AMPL version

Args:: filename (str): tar file
Returns:: the AMPL version number

utils.model_version_utils.get_ampl_version_from_model(filename)[source]¶

Get the AMPL version from the tar file’s model_metadata.json

Args:: filename (str): tar file
Returns:: the AMPL version number

utils.model_version_utils.get_major_version(full_version)[source]¶

utils.model_version_utils.main(argv)[source]¶

utils.model_version_utils.validate_version(input)[source]¶

utils.pubchem_utils module¶

utils.pubchem_utils.download_SID_from_bioactivity_assay(bioassayid)[source]¶

Retrieve summary info on bioactivity assays.

Args:: a single bioactivity id: PubChem AIDs (bioactivity assay ids)
Returns:: Returns the sids tested on this assay

utils.pubchem_utils.download_activitytype(aid, sid)[source]¶

Retrieve data for assays for a select list of sids.

Args:

myList (list): a bioactivity id (aid)

sidlst (list): list of sids specified as integers

Returns:

Nothing returned yet, will return basic stats to help decide whether to use assay or not

utils.pubchem_utils.download_bioactivity_assay(myList, intv=1)[source]¶

Retrieve summary info on bioactivity assays.

Args:

myList (list): List of PubChem AIDs (bioactivity assay ids)

intv (1): number of INCHIKEYS to submit queries for in one request, default is 1

Returns:

Nothing returned yet, will return basic stats to help decide whether to use assay or not

utils.pubchem_utils.download_dose_response_from_bioactivity(aid, sidlst)[source]¶

Retrieve data for assays for a select list of sids.

Args:

myList (list): a bioactivity id (aid)

sidlst (list): list of sids specified as integers

Returns:

Nothing returned yet, will return basic stats to help decide whether to use assay or not

utils.pubchem_utils.download_smiles(myList, intv=1)[source]¶

Retrieve canonical SMILES strings for a list of input INCHIKEYS. Will return only one SMILES string per INCHIKEY. If there are multiple values returned, the first is retained and the others are returned in a the discard_lst. INCHIKEYS that fail to return a SMILES string are put in the fail_lst

Args:

myList (list): List of INCHIKEYS

intv (1): number of INCHIKEYS to submit queries for in one request, default is 1

Returns:

list of SMILES strings corresponding to INCHIKEYS

list of INCHIKEYS, which failed to return a SMILES string

list of CIDs and SMILES, which were returned beyond the first CID and SMILE found for input INCHIKEY

utils.rdkit_easy module¶

utils.split_response_dist_plots module¶

Module to plot distributions of response values in each subset of a dataset generated by a split

utils.split_response_dist_plots.get_split_labeled_dataset(params)[source]¶

Add a column to a dataset labeling the split subset for each row. Given a dataset and split parameters (including split_uuid) referenced in params, returns a data frame containing the dataset with an extra ‘split_subset’ column indicating the subset each data point belongs to. For standard 3-way splits, the labels will be ‘train’, ‘valid’ and ‘test’. For a k-fold CV split, the labels will be ‘fold_0’ through ‘fold_<k-1>’ and ‘test’.

Args:: params (argparse.Namespace or dict): Structure containing dataset and split parameters. The following parameters are required, if not set to default values: - dataset_key - split_uuid - split_strategy - splitter - split_valid_frac - split_test_frac - num_folds - smiles_col - response_cols
Returns:: A tuple (dset_df, split_label): - dset_df (DataFrame): The dataset specified by params.dataset_key, with additional column split_subset. - split_label (str): A short description of the split, useful for plot labeling.

utils.split_response_dist_plots.plot_split_subset_response_distrs(params)[source]¶

Plot the distributions of the response variable(s) in each split subset of a dataset. Args:

params (argparse.Namespace or dict): Structure containing dataset and split parameters. The following parameters are required, if not set to default values: - dataset_key - split_uuid - split_strategy - splitter - split_valid_frac - split_test_frac - num_folds - smiles_col - response_cols

Returns:: None

utils.struct_utils module¶

Functions to manipulate and convert between various representations of chemical structures: SMILES, InChi and RDKit Mol objects. Many of these functions (those with a ‘workers’ argument) accept either a single SMILES or InChi string or a list of strings as their first argument, and return a value with the same datatype. If a list is passed and the ‘workers’ argument is > 1, the calculation is parallelized across multiple threads; this can save significant time when operating on thousands of molecules.

utils.struct_utils.base_mol_from_inchi(inchi_str, useIsomericSmiles=True, removeCharges=False)[source]¶

Generate a standardized RDKit Mol object for the largest fragment of the molecule specified by InChi string inchi_str. Replace any rare isotopes with the most common ones for each element. If removeCharges is True, add hydrogens as needed to eliminate charges.

Args:

inchi_str (str): InChi string representing molecule.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated string.

removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.

Returns:

str: Standardized salt-stripped SMILES string.

utils.struct_utils.base_mol_from_smiles(orig_smiles, useIsomericSmiles=True, removeCharges=False)[source]¶

Generate a standardized RDKit Mol object for the largest fragment of the molecule specified by orig_smiles. Replace any rare isotopes with the most common ones for each element. If removeCharges is True, add hydrogens as needed to eliminate charges.

Args:

orig_smiles (str): SMILES string to standardize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated string.

removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.

Returns:

str: Standardized salt-stripped SMILES string.

utils.struct_utils.base_smiles_from_inchi(inchi_str, useIsomericSmiles=True, removeCharges=False, workers=1)[source]¶

Generate standardized salt-stripped SMILES strings for the largest fragments of each molecule represented by InChi string(s) inchi_str. Replaces any rare isotopes with the most common ones for each element.

Args:

inchi_str (list or str): List of InChi strings to convert.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.

removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.

workers (int): Number of parallel threads to use for calculation.

Returns:

list or str: Standardized SMILES strings.

utils.struct_utils.base_smiles_from_smiles(orig_smiles, useIsomericSmiles=True, removeCharges=False, useCanonicalTautomers=False, workers=1)[source]¶

Generate standardized SMILES strings for the largest fragments of each molecule specified by orig_smiles. Strips salt groups and replaces any rare isotopes with the most common ones for each element.

Args:

orig_smiles (list or str): List of SMILES strings to canonicalize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.

removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.

useCanonicalTautomers (bool): Whether to convert the generated SMILES to their canonical tautomers. Defaults to False for backward compatibility.

workers (int): Number of parallel threads to use for calculation.

Returns:

list or str: Canonicalized SMILES strings.

utils.struct_utils.canonical_tautomers_from_smiles(smiles)[source]¶

Returns SMILES strings for the canonical tautomers of a SMILES string or list of SMILES strings

Args:: smiles (list or str): List of SMILES strings.
Returns:: (list of str) : List of SMILES strings for the canonical tautomers.

utils.struct_utils.draw_structure(smiles_str, image_path, image_size=500)[source]¶

Draw structure for the compound with the given SMILES string as a PNG file.

Note that there are more flexible functions for drawing structures in the rdkit_easy module. This function is only retained for backward compatibility.

Args:

smiles_str (str): SMILES representation of compound.

image_path (str): Filepath for image file to be generated.

image_size (int): Width of square bounding box for image.

Returns:

None.

utils.struct_utils.fix_moe_smiles(smiles)[source]¶

Correct the SMILES strings generated by MOE to standardize the representation of protonated atoms, so that RDKit can read them.

Args:: smiles (str): SMILES string.
Returns:: str: The corrected SMILES string.

utils.struct_utils.get_rdkit_smiles(orig_smiles, useIsomericSmiles=True)[source]¶

Given a SMILES string, regenerate a “canonical” SMILES string for the same molecule using the implementation in RDKit.

Args:

orig_smiles (str): SMILES string to canonicalize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated string.

Returns:

str: Canonicalized SMILES string.

utils.struct_utils.kekulize_smiles(orig_smiles, useIsomericSmiles=True, workers=1)[source]¶

Generate Kekulized SMILES strings for the molecules specified by orig_smiles. Kekulized SMILES strings are ones in which aromatic rings are represented by uppercase letters with alternating single and double bonds, rather than lowercase letters; they are needed by some external applications.

Args:

orig_smiles (list or str): List of SMILES strings to Kekulize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.

workers (int): Number of parallel threads to use for calculation.

Returns:

list or str: Kekulized SMILES strings.

utils.struct_utils.mol_wt_from_smiles(smiles, workers=1)[source]¶

Calculate molecular weights for molecules represented by SMILES strings.

Args:

smiles (list or str): List of SMILES strings.

workers (int): Number of parallel threads to use for calculations.

Returns:

list or float: Molecular weights. NaN is returned for SMILES strings that could not be read by RDKit.

utils.struct_utils.mols_from_smiles(orig_smiles, workers=1)[source]¶

Parallel function to create RDKit Mol objects for a list of SMILES strings. If orig_smiles is a list and workers is > 1, spawn ‘workers’ threads to convert input SMILES strings to Mol objects.

Args:

orig_smiles (list or str): List of SMILES strings to convert to Mol objects.

workers (int): Number of parallel threads to use for calculation.

Returns:

list of rdkit.Chem.Mol: RDKit objects representing molecules.

utils.struct_utils.rdkit_smiles_from_smiles(orig_smiles, useIsomericSmiles=True, useCanonicalTautomers=False, workers=1)[source]¶

Parallel version of get_rdkit_smiles. If orig_smiles is a list and workers is > 1, spawn ‘workers’ threads to convert input SMILES strings to standardized RDKit format.

Args:

orig_smiles (list or str): List of SMILES strings to canonicalize.

useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.

useCanonicalTautomers (bool): Whether to convert the generated SMILES to their canonical tautomers. Defaults to False for backward compatibility.

workers (int): Number of parallel threads to use for calculation.

Returns:

list or str: Canonicalized SMILES strings.

utils.struct_utils.smiles_to_inchi_key(smiles)[source]¶

Generates an InChI key from a SMILES string. Note that an InChI key is different from an InChI string; it can be used as a unique identifier, but doesn’t hold the information needed to reconstruct a molecule.

Args:: smiles (str): SMILES string.
Returns:: str: An InChI key. Returns None if RDKit cannot convert the SMILES string to an RDKit Mol object.

utils package¶

Submodules¶

utils.compare_splits_plots module¶

utils.curate_data module¶

utils.data_curation_functions module¶

utils.datastore_functions module¶

utils.hyperparam_search_wrapper module¶

utils.many_to_one module¶

utils.model_file_reader module¶

utils.model_retrain module¶

utils.model_version_utils module¶

utils.pubchem_utils module¶

utils.rdkit_easy module¶

utils.split_response_dist_plots module¶

utils.struct_utils module¶

Module contents¶