utils package¶
Submodules¶
utils.compare_splits_plots module¶
- class utils.compare_splits_plots.SplitStats(total_df, split_df, smiles_col, id_col, response_cols)[source]¶
Bases:
object
This object manages a dataset and a given split dataframe.
- dist_hist_plot(dists, title, dist_path='')[source]¶
Creates a histogram of pairwise Tanimoto distances between training and test sets
- Args:
- dist_path (str): Optional Where to save the plot. The string ‘_dist_hist’ will be
appended to this input
- dist_hist_train_v_test_plot(ax=None)[source]¶
Plots Tanimoto differences between training and valid subsets
- Returns:
g (Seaborn FacetGrid): FacetGrid object from seaborn
- dist_hist_train_v_valid_plot(ax=None)[source]¶
Plots Tanimoto differences between training and valid subsets
- Returns:
g (Seaborn FacetGrid): FacetGrid object from seaborn
- make_all_plots(dist_path='')[source]¶
Makes a series of diagnostic plots
- Args:
- dist_path (str): Optional Where to save the plot. The string ‘_frac_box’ will be
appended to this input
- utils.compare_splits_plots.save_figure(filename)[source]¶
Saves a figure to disk. Saves both png and svg formats.
- Args:
filename (str): The name of the figure.
- utils.compare_splits_plots.split(total_df, split_df, id_col)[source]¶
Splits a dataset into training, test and validation sets using a given split.
- Args:
total_df (DataFrame): A pandas dataframe. split_df (DataFrame): A split dataframe containing ‘cmpd_id’ and ‘subset’ columns. id_col (str): The ID column in total_df
- Returns:
- (DataFrame, DataFrame, DataFrame): Three dataframes for train, test, and valid
respectively.
utils.curate_data module¶
Utility functions used for AMPL dataset curation and creation.
- utils.curate_data.add_classification_column(thresholds, value_column, label_column, data, right_inclusive=True)[source]¶
Add a classification column to a DataFrame.
Add a classification column ‘label_column’ to DataFrame ‘data’ based on values in ‘value_column’, according to a sequence of thresholds. The number of classes is one plus the number of thresholds.
- Args:
- thresholds (float or sequence of floats): Thresholds to use to assign class labels. Label i will
be assigned to values such that thresholds[i-1] < value <= thresholds[i] (if right_inclusive is True) or thresholds[i-1] <= value < thresholds[i] (otherwise).
value_column (str): Name of the column from which class labels are derived.
label_column (str): Name of the new column to be created for class labels.
data (DataFrame): DataFrame holding all data.
- right_inclusive (bool): Whether the thresholding intervals are closed on the right or on the left.
Set this False to get the same behavior as add_binary_tertiary_classification. The default behavior is preferred for the common case where the classification is based on a left-censoring threshold.
- Returns:
DataFrame: DataFrame updated to include class label column.
- utils.curate_data.aggregate_assay_data(assay_df, value_col='VALUE_NUM', output_value_col=None, label_actives=True, active_thresh=None, id_col='CMPD_NUMBER', smiles_col='rdkit_smiles', relation_col='VALUE_FLAG', date_col=None, verbose=False)[source]¶
Aggregates replicated values in assay data
Map RDKit SMILES strings in assay_df to base structures, then compute an MLE estimate of the mean value over replicate measurements for the same SMILES strings, taking censoring into account. Generate an aggregated result table with one value for each unique base SMILES string, to be used in an ML-ready dataset.
- Args:
assay_df (DataFrame): The input DataFrame to be processed.
value_col (str): The column in the DataFrame containing assay values to be averaged.
output_value_col (str): Optional; the column name to use in the output DataFrame for the averaged data.
label_actives (bool): If True, generate an additional column ‘active’ indicating whether the mean value is above a threshold specified by active_thresh.
- active_thresh (float): The threshold to be used for labeling compounds as active or inactive.
If active_thresh is None (the default), the threshold used is the minimum reported value across all records with left-censored values (i.e., those with ‘<’ in the relation column.
id_col (str): The input DataFrame column containing compound IDs.
smiles_col (str): The input DataFrame column containing SMILES strings.
relation_col (str): The input DataFrame column containing relational operators (<, >, etc.).
- date_col (str): The input DataFrame column containing dates when the assay data was uploaded. If not None, the code will assign the earliest
date among replicates to the aggregate data record.
- Returns:
A DataFrame containing averaged assay values, with one value per compound.
- utils.curate_data.average_and_remove_duplicates(column, tolerance, list_bad_duplicates, data, max_stdev=100000, compound_id='CMPD_NUMBER', rm_duplicate_only=False, smiles_col='rdkit_smiles_parent')[source]¶
This while loop loops through until no ‘bad duplicates’ are left.
This function removes duplicates based on max_stdev and tolerance. If the value in data[column] falls too far from the mean based on tolerance and max_stdev then that entry is removed. This is repeated until all bad entries are removed
- Args:
column (str): column with the value of interest
- tolerance (float): acceptable % difference between value and average
ie.: if “[(value - mean)/mean*100]>tolerance” then remove data row
list_bad_duplicates (str): ‘Yes’ to list the bad duplicates
data (DataFrame): input DataFrame
max_stdev (float): maximum standard deviation threshold
compound_id (str): column containing compound ids
- rm_duplicate_only (bool): only remove bad duplicates, don’t average good ones, the resulting table can be fed into aggregate assay data to further process.
note: The mean is recalculated on each loop through to make sure it isn’t skewed by the ‘bad duplicate’ values
smiles_col (str): column containing base rdkit smiles strings
- Returns:
DataFrame: Returns remaining rows after all bad duplicates have been removed.
- utils.curate_data.create_new_rows_for_extra_results(extra_result_col, value_col, data)[source]¶
Moves results from an extra column to an existing column
Returns a new DataFrame with values from ‘extra_result_col’ appended to the end of ‘value_col’. NaN values in ‘extra_result_col’ are dropped. ‘Extra_result_col’ is dropped from the resulting DataFrame
- Args:
extra_result_col (str): A column in ‘data’.
value_col (str): A column in ‘data’.
data (DataFrame):
- Returns:
DataFrame
- utils.curate_data.filter_in_by_column_values(column, values, data)[source]¶
Include rows only for given values in specified column.
Filters in all rows in data if row[column] in values.
- Args:
column (str): Name of a column in data.
- values (iterable): An iterable, Series, DataFrame, or dict of values
contained in data[column].
data (DataFrame): A DataFrame.
- Returns:
DataFrame: DataFrame containing filtered rows.
- utils.curate_data.filter_in_out_by_column_values(column, values, data, in_out)[source]¶
Include rows only for given values in specified column.
Given a DataFrame, column, and an iterable, Series, DataFrame, or dict, of values, return a DataFrame with rows containing value in values or all rows that do not containe a value in values.
- Args:
column (str): Name of a column in data.
- values (iterable): An iterable, Series, DataFrame, or dict of values
contained in data[column].
data (DataFrame): A DataFrame.
- in_out (str): If set to ‘in’, will filter in rows that contain a value
in values. If set to anything else, this function will filter out rows that contian a value in values.
- Returns:
DataFrame: DataFrame containing filtered rows.
- utils.curate_data.filter_out_by_column_values(column, values, data)[source]¶
Exclude rows only for given values in specified column.
Filters out all rows in data if row[column] in values.
- Args:
column (str): Name of a column in data.
- values (iterable): An iterable, Series, DataFrame, or dict of values
contained in data[column].
data (DataFrame): A DataFrame.
- Returns:
DataFrame: DataFrame containing filtered rows.
- utils.curate_data.filter_out_comments(values, values_cs, data)[source]¶
Remove rows that contain the text listed
Removes any rows where data[‘COMMENTS’] contains the words in values or values_cs. Used for removing results that indicate bad data in the comments.
- Args:
values (str): list of values that are not case sensitive
values_cs (str): list of values that are case sensitive
data (DataFrame): DataFrame containing a column named ‘COMMENTS’
- Returns:
DataFrame: Returns a DataFrame with the remaining rows
- utils.curate_data.freq_table(dset_df, column, min_freq=1)[source]¶
Generate a DataFrame tabluating the repeat requencies of unique values.
Generate a DataFrame tabulating the repeat frequencies of each unique value in ‘column’. Restrict it to values occurring at least min_freq times.
- Args:
dset_df (DataFrame): An input DataFrame
column (str): The name of one column in DataFrame
min_freq (int): Restrict unique count to at least min_freq times.
- Returns:
- DataFrame: Dataframe containing two columns: the column passed in as the ‘column’ argument
and the column ‘Count’. The ‘Count’ column contains the number of occurances for each value in the ‘column’ argument.
- utils.curate_data.get_rdkit_smiles_parent(data)[source]¶
Strip the salts off the rdkit SMILES strings
First, loops through data and determines the base/parent smiles string for each row. Appends the base smiles string to a new row in a list. Then adds the list as a new column, ‘rdkit_smiles_parent’, in ‘data’. Basically calls base_smiles_from_smiles for each smile in the column ‘rdkit_smiles’
- Args:
data (DataFrame): A DataFrame with a column named ‘rdkit_smiles’.
- Returns:
DataFrame with column ‘rdkit_smiles_parent’ with salts stripped
- utils.curate_data.labeled_freq_table(dset_df, columns, min_freq=1)[source]¶
Generate a frequency table in which additional columns are included.
Generate a frequency table in which additional columns are included. The first column in ‘columns’ is assumed to be a unique ID; there should be a many-to-1 mapping from the ID to each of the additional columns.
- Args:
dset_df (DataFrame): The input DataFrame.
- columns (list(str)): A list of columns to include in the output frequency table.
The first column in ‘columns’ is assumed to be a unique ID; there should be a many-to-1 mapping from the ID to each of the additional columns.
min_freq (int): Restrict unique count to at least min_freq times.
- Returns:
DataFrame: A DataFrame containing a frequency table.
- Raises:
- Exception: If the DataFrame violates the rule: there should be a many-to-1
mapping from the ID to each of the additional columns.
- utils.curate_data.mle_censored_mean(cmpd_df, std_est, value_col='PIC50', relation_col='relation')[source]¶
Computes maximum likelihood estimate of the true mean value for a single replicated compound.
Compute a maximum likelihood estimate of the true mean value underlying the distribution of replicate assay measurements for a single compound. The data may be a mix of censored and uncensored measurements, as indicated by the ‘relation’ column in the input DataFrame cmpd_df. std_est is an estimate for the standard deviation of the distribution, which is assumed to be Gaussian; we typically compute a common estimate for the whole dataset using replicate_rmsd().
- Args:
cmpd_df (DataFrame): DataFrame containing measurements and SMILES strings.
std_est (float): An estimate for the standard deviation of the distribution.
smiles_col (str): Name of the column that contains SMILES strings.
value_col (str): Name of the column that contains target values.
relation_col (str): The input DataFrame column containing relational operators (<, >, etc.).
- Returns:
float: maximum likelihood estimate of the true mean for a replicated compound str: Relation, ‘’ not censored, ‘>’ right censored, ‘<’ left censored
- utils.curate_data.remove_outlier_replicates(df, response_col='pIC50', id_col='compound_id', max_diff_from_median=1.0)[source]¶
Examine groups of replicate measurements for compounds identified by compound ID and compute median response for each group. Eliminate measurements that differ by more than a given value from the median; note that in some groups this will result in all replicates being deleted. This function should be used together with aggregate_assay_data instead of average_and_remove_duplicates to reduce data to a single value per compound.
- Args:
df (DataFrame): Table of compounds and response data
response_col (str): Column containing response values
id_col (str): Column that uniquely identifies compounds, and therefore measurements to be treated as replicates.
max_diff_from_median (float): Maximum absolute difference from median value allowed for retained replicates.
- Returns:
result_df (DataFrame): Filtered data frame with outlier replicates removed.
- utils.curate_data.replicate_rmsd(dset_df, smiles_col='base_rdkit_smiles', value_col='PIC50', relation_col='relation', default_val=1.0)[source]¶
Compute RMS deviation of all replicate uncensored measurements from means
Compute RMS deviation of all replicate uncensored measurements in dset_df from their means. Measurements are treated as replicates if they correspond to the same SMILES string, and are considered censored if the relation column contains > or <. The resulting value is meant to be used as an estimate of measurement error for all compounds in the dataset.
- Args:
dset_df (DataFrame): DataFrame containing uncensored measurements and SMILES strings.
smiles_col (str): Name of the column that contains SMILES strings.
value_col (str): Name of the column that contains target values.
relation_col (str): The input DataFrame column containing relational operators (<, >, etc.).
default_val (float): The value to return if there are no compounds with replicate measurements.
- Returns:
float: returns root mean squared deviation of all replicate uncensored measurements
- utils.curate_data.set_group_permissions(path, system='AD', owner='GSK')[source]¶
Sets file and group permissions to standard values for a dataset containing proprietary data owned by ‘owner’. Later we may add a ‘public’ option, or groups for data from other pharma companies.
- Args:
path (string): File path
system (string): Computing environment from which group ownerships will be derived; currently, either ‘LC’ for LC filesystems or ‘AD’ for LLNL systems where owners and groups are managed by Active Directory.
owner (string): Who the data belongs to, either ‘public’ or the name of a company (e.g. ‘GSK’) associated with a restricted access group.
- Returns:
None
- utils.curate_data.summarize_data(column, num_bins, title, units, filepath, data, log_column='No')[source]¶
Summarizes the in data[column]
Summarizes the data by printing mean, stdev, max, and min of the data. Creates plots of the binned values in data[column]. If log_column != ‘No’ this also creates plots that compares normal and log distributions of the data.
- Args:
column (str): Column of interest.
num_bins (int): Number of bins in the histogram.
title (str): Title of the histogram.
units (str): Units for values in ‘column’.
filepath (str): This file path gets printed to the console.
data (DataFrame): Input DataFrame.
- log_column (str): Defaults to ‘No’. Any other value will generate
a plot comparing normal and log distributions.
- Returns:
None
utils.data_curation_functions module¶
data_curation_functions.py
Extract Kevin’s functions for curation of public datasets Modify them to match Jonathan’s curation methods in notebook 01/30/2020
- utils.data_curation_functions.atom_curation(targ_lst, smiles_lst, shared_inchi_keys)[source]¶
Apply ATOM standard ‘curation’ step to “shared_df”: Average replicate assays, remove duplicates and drop cases with large variance between replicates. mleqonly
- Args:
targ_lst (list): A list of targets.
- smiles_lst (list): A list of DataFrames.
These DataFrames must contain the columns gene_names, standard_type, standard_relation, standard_inchi_key, PIC50, and rdkit_smiles
shared_inchi_keys (list): A list of inchi keys used in this dataset.
- Returns:
- list, list:A list of curated DataFrames and a list of the number of compounds
dropped during the curation process for each target.
- utils.data_curation_functions.atom_curation_excape(targ_lst, smiles_lst, shared_inchi_keys)[source]¶
Apply ATOM standard ‘curation’ step: Average replicate assays, remove duplicates and drop cases with large variance between replicates. Rows with NaN values in rdkit_smiles, VALUE_NUM_mean, and pXC50 are dropped
- Args:
targ_lst (list): A list of targets.
- smiles_lst (list): A of DataFrames.
These DataFrames must contain the columns gene_names, standard_type, standard_relation, standard_inchi_key, pXC50, and rdkit_smiles
shared_inchi_keys (list): A list of inchi keys used in this dataset.
- Returns:
list:A list of curated DataFrames
- utils.data_curation_functions.compute_negative_log_responses(df, unit_col='unit', value_col='value', new_value_col='average_col', relation_col=None, new_relation_col=None, unit_conv={'nM': <function <lambda>>, 'uM': <function <lambda>>}, inplace=False)[source]¶
Given the response values in value_col (IC50, Ki, Kd, etc.), compute their negative base 10 logarithms (pIC50, pKi, pKd, etc.) after converting them to molar units and store them in new_value_col. If relation_col is provided, replace any ‘<’ or ‘>’ relations with their opposites and store the result in new_relation_col (if provided), or in relation_col if note. Rows where the original value is 0 or negative will be dropped from the dataset.
- Args:
df (DataFrame): A DataFrame that contains value_col, unit_col and relation_col.
unit_conv (dict): A dictionary mapping concentration units found in unit_col to functions that convert the corresponding concentrations to molar. The default handles micromolar and nanomolar units, represented as ‘uM’ and ‘nM’ respectively.
unit_col (str): Column containing units.
value_col (str): Column containing input values.
new_value_col (str): Column to receive converted values.
relation_col (str): Column containing relational operators for censored data.
new_relation_col (str): Column to receive inverted relations applicable to the negative log transformed values.
inplace (bool): If True, the input DataFrame is modified in place when possible. The default is to return a copy
- Returns:
DataFrame: A table containing the transformed values and relations.
- utils.data_curation_functions.convert_IC50_to_pIC50(df, unit_col='unit', value_col='value', new_value_col='average_col', relation_col=None, new_relation_col=None, unit_conv={'nM': <function <lambda>>, 'uM': <function <lambda>>}, inplace=False)[source]¶
For backward compatibiltiy only: equivalent to calling compute_negative_log_responses with the same arguments.
- utils.data_curation_functions.down_select(df, kv_lst)[source]¶
Filters rows given a set of values
Given a DataFrame and a list of tuples columns (k) to values (v), this function filters out all rows where df[k] == v.
- Args:
df (DataFrame): An input DataFrame.
kv_list (list): A list of tuples of (column, value)
- Returns:
DataFrame: Rows where all df[k] == v
- utils.data_curation_functions.exclude_organometallics(df, smiles_col='rdkit_smiles')[source]¶
Filters data frame df based on column smiles_col to exclude organometallic compounds
- utils.data_curation_functions.filter_dtc_data(orig_df, geneNames)[source]¶
Extracts and post processes JAK1, 2, and 3 datasets from DTC
This is specific to the DTC database. Extract JAK1, 2 and 3 datasets from Drug Target Commons database, filtered for data usability. filter criteria:
gene_names == JAK1 | JAK2 | JAK3 InChi key not missing standard_type IC50 units NM standard_relation mappable to =, < or > wildtype_or_mutant != ‘mutated’ valid SMILES maps to valid RDKit base SMILES standard_value not missing pIC50 > 3
- Args:
- orig_df (DataFrame): Input DataFrame. Must contain the following columns: gene_names
standard_inchi_key, standard_type, standard_units, standard_value, compound_id, wildtype_or_mutant.
geneNames (list): A list of gene names to filter out of orig_df e.g. [‘JAK1’, ‘JAK2’].
- Returns:
DataFrame: The filtered rows of the orig_df
- utils.data_curation_functions.get_smiles_4dtc_data(nm_df, targ_lst, save_smiles_df)[source]¶
Returns SMILES strings from DTC data
nm_df must be a DataFrame from DTC with the following columns: gene_names, standard_type, standard_value, ‘standard_inchi_key’, and standard_relation.
This function selects all rows where nm_df[‘gene_names’] is in targ_lst, nm_df[‘standard_type’]==’IC50’, nm_df[‘standard_relation’]==’=’, and ‘standard_value’ > 0.
Then pIC50 values are calculated and added to the ‘PIC50’ column, and smiles strings are merged in from save_smiles_df
- Args:
nm_df (DataFrame): Input DataFrame.
targ_lst (list): A list of targets.
save_smiles_df (DataFrame): A DataFrame with the column ‘standard_inchi_key’
- Returns:
- list, list, str: A list of smiles. A list of inchi keys shared between targets.
And a description of the targets
- utils.data_curation_functions.get_smiles_dtc_data(nm_df, targ_lst, save_smiles_df)[source]¶
Returns SMILES strings from DTC data
nm_df must be a DataFrame from DTC with the following columns: gene_names, standard_type, standard_value, ‘standard_inchi_key’, and standard_relation.
This function selects all rows where nm_df[‘gene_names’] is in targ_lst, nm_df[‘standard_type’]==’IC50’, nm_df[‘standard_relation’]==’=’, and ‘standard_value’ > 0.
Then pIC50 values are calculated and added to the ‘PIC50’ column, and smiles strings are merged in from save_smiles_df
- Args:
nm_df (DataFrame): Input DataFrame.
targ_lst (list): A list of targets.
save_smiles_df (DataFrame): A DataFrame with the column ‘standard_inchi_key’
- Returns:
list, list: A list of smiles and a list of inchi keys shared between targets.
- utils.data_curation_functions.get_smiles_excape_data(nm_df, targ_lst)[source]¶
Calculate base rdkit smiles
Divides up nm_df based on target and makes one DataFrame for each target.
Rows with NaN pXC50 values are dropped. Base rdkit SMILES are calculated from the SMILES column using atomsci.ddm.utils.struct_utils.base_rdkit_smiles_from_smiles. A new column, ‘rdkit_smiles, is added to each output DataFrame.
- Args:
- nm_df (DataFrame): DataFrame for Excape database. Should contain the columns,
pXC50, SMILES, and Ambit_InchiKey
targ_lst (list): A list of targets to filter out of nm_df
- Returns:
- list, list: A list of DataFrames, one for each target, and a list of
all inchi keys used in the dataset.
- utils.data_curation_functions.ic50topic50(x)[source]¶
Calculates pIC50 from IC50
- Args:
x (float): An IC50 in nanomolar (nM) units.
- Returns:
float: The pIC50.
- utils.data_curation_functions.is_organometallic(mol)[source]¶
Returns True if the molecule is organometallic
- utils.data_curation_functions.set_data_root(dir)[source]¶
Set global variables for data directories
Creates paths for DTC and Excape given a root data directory. Global variables ‘data_root’ and ‘data_dirs’. ‘data_root’ is the root data directory. ‘data_dirs’ is a dictionary that maps ‘DTC’ and ‘Excape’ to directores calcuated from ‘data_root’
- Args:
dir (str): root data directory containing folds ‘dtc’ and ‘excape’
- Returns:
None
- utils.data_curation_functions.standardize_relations(dset_df, db=None, rel_col=None, output_rel_col=None, invert=False)[source]¶
Standardizes censoring operators
Standardize the censoring operators to =, < or >, and remove any rows whose operators don’t map to a standard one. There is a special case for db=’ChEMBL’ that strips the extra “‘“s around relationship symbols. Assumes relationship columns are ‘Standard Relation’, ‘standard_relation’ and ‘activity_prefix’ for ChEMBL, DTC and GoStar respectively.
This function makes the following mappings: “>” to “>”, “>=” to “>”, “<” to “<”, “<=” to “<”, and “=” to “=”. All other relations are removed from the DataFrame.
- Args:
- dset_df (DataFrame): Input DataFrame. Must contain either ‘Standard Relation’
or ‘standard_relation’
db (str): Source database. Must be either ‘GoStar’, ‘DTC’ or ‘ChEMBL’. Required if rel_col is not specified.
- rel_col (str): Column containing relational operators. If specified, overrides the default relation column
for db.
- output_rel_col (str): If specified, put the standardized operators in a new column with this name and leave
the original operator column unchanged.
- invert (bool): If true, replace the inequality operators with their inverses. This is useful when a reported
value such as IC50 is converted to its negative log such as pIC50.
- Returns:
DataFrame: Dataframe with the standardized relationship sybmols
- utils.data_curation_functions.upload_df_dtc_base_smiles_all(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_mleqonly_fileID, data_origin='journal', species='human', force_update=False)[source]¶
Uploads DTC base smiles data to the datastore
Uploads base SMILES string for the DTC dataset.
Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’, the response column is set to PIC50, and the SMILES are assumed to be in ‘base_rdkit_smiles’.
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
data_df (DataFrame): DataFrame to be uploaded.
dtc_mleqonly_fileID (str): Source file id used to generate data_df.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_df_dtc_mleqonly(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_smiles_fileID, data_origin='journal', species='human', force_update=False)[source]¶
Uploads DTC mleqonly data to the datastore
Upload mleqonly data to the datastore from the given DataFrame. The DataFrame must contain the column ‘rdkit_smiles’ and ‘VALUE_NUM_mean’. This function is meant to upload data that has been aggregated using atomsci.ddm.utils.curate_data.average_and_remove_duplicates. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’.
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
data_df (DataFrame): DataFrame to be uploaded.
dtc_smiles_fileID (str): Source file id used to generate data_df.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_df_dtc_mleqonly_class(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_mleqonly_fileID, data_origin='journal', species='human', force_update=False)[source]¶
Uploads DTC mleqonly classification data to the datastore
Upload mleqonly classification data to the datastore from the given DataFrame. The DataFrame must contain the column ‘rdkit_smiles’ and ‘binary_class’. This function is meant to upload data that has been aggregated using atomsci.ddm.utils.curate_data.average_and_remove_duplicates and then thresholded to make a binary classification dataset. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’.
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
data_df (DataFrame): DataFrame to be uploaded.
dtc_mleqonly_fileID (str): Source file id used to generate data_df.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_df_dtc_smiles(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, smiles_df, orig_fileID, data_origin='journal', species='human', force_update=False)[source]¶
Uploads DTC smiles data to the datastore
Upload a raw dataset to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
smiles_df (DataFrame): DataFrame containing SMILES to be uploaded.
orig_fileID (str): Source file id used to generate smiles_df.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_df_dtc_smiles_regr_all_class(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, dtc_smiles_regr_all_fileID, smiles_column, data_origin='journal', species='human', force_update=False)[source]¶
Uploads DTC classification data to the datastore
Uploads binary classiciation data for the DTC dataset. Classnames are assumed to be ‘active’ and ‘inactive’
Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’, the response column is set to PIC50.
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
data_df (DataFrame): DataFrame to be uploaded.
dtc_smiles_regr_all_fileID(str): Source file id used to generate data_df.
smiles_column (str): Column containing SMILES.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_df_excape_mleqonly(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, smiles_fileID, data_origin='journal', species='human', force_update=False)[source]¶
Uploads Excape mleqonly data to the datastore
Upload mleqonly to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’, smiles_col is ‘rdkit_smiles’ and response_col is ‘VALUE_NUM_mean’.
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
data_df (DataFrame): DataFrame containing SMILES to be uploaded.
smiles_fileID (str): Source file id used to generate data_df.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_df_excape_mleqonly_class(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, data_df, mleqonly_fileID, data_origin='journal', species='human', force_update=False)[source]¶
Uploads Excape mleqonly classification data to the datastore
data_df contains a binary classification dataset with ‘active’ and ‘incative’ classes.
Upload mleqonly classification to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’, smiles_col is ‘rdkit_smiles’ and response_col is ‘binary_class’.
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
data_df (DataFrame): DataFrame containing SMILES to be uploaded.
mleqonly_fileID (str): Source file id used to generate data_df.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_df_excape_smiles(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, smiles_df, orig_fileID, data_origin='journal', species='human', force_update=False)[source]¶
Uploads Excape SMILES data to the datastore
Upload SMILES to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
smiles_df (DataFrame): DataFrame containing SMILES to be uploaded.
orig_fileID (str): Source file id used to generate smiles_df.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_file_dtc_raw_data(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, file_path, data_origin='journal', species='human', force_update=False)[source]¶
Uploads raw DTC data to the datastore
Upload a raw dataset to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
file_path (str): The filepath of the dataset.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_file_dtc_smiles_regr_all(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, file_path, dtc_smiles_fileID, smiles_column, data_origin='journal', species='human', force_update=False)[source]¶
Uploads regression DTC data to the datastore
Uploads regression dataset for DTC dataset.
Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://doi.org/10.1016/j.chembiol.2017.11.009’ as the doi. This also assumes that the id_col is ‘compound_id’, the response column is set to PIC50.
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
data_df (DataFrame): DataFrame to be uploaded.
dtc_smiles_fileID(str): Source file id used to generate data_df.
smiles_column (str): Column containing SMILES.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
- utils.data_curation_functions.upload_file_excape_raw_data(dset_name, title, description, tags, functional_area, target, target_type, activity, assay_category, file_path, data_origin='journal', species='human', force_update=False)[source]¶
Uploads raw Excape data to the datastore
Upload a raw dataset to the datastore from the given DataFrame. Returns the datastore OID of the uploaded dataset. The dataset is uploaded to the public bucket and lists https://dx.doi.org/10.1186%2Fs13321-017-0203-5 as the doi. This also assumes that the id_col is ‘Original_Entry_ID’
- Args:
dset_name (str): Name of the dataset. Should not include a file extension.
title (str): title of the file in (human friendly format)
description (str): long text box to describe file (background/use notes)
tags (list): Must be a list of strings.
functional_area (str): The functional area.
target (str): The target.
target_type (str): The target type of the dataset.
activity (str): The activity of the dataset.
assay_category (str): The assay category of the dataset.
file_path (str): The filepath of the dataset.
data_origin (str): The origin of the dataset e.g. journal.
species (str): The species of the dataset e.g. human, rat, dog.
force_update (bool): Overwrite existing datasets in the datastore.
- Returns:
str: datastore OID of the uploaded dataset.
utils.datastore_functions module¶
utils.hyperparam_search_wrapper module¶
utils.many_to_one module¶
- utils.many_to_one.many_to_one_df(df, smiles_col, id_col)[source]¶
AMPL requires that SMILES and compound_ids have a many to one mapping. This function opens the dataset and checks this restraint. It will also check if any SMILES or compound_ids are empty/nan
- Arguments:
df (pd.DataFrame): The DataFrame in question. smiles_col (str): The column containing SMILES. id_col (str): The column containing compound ids
- Returns:
- True if there is a many to one mapping. Raises one of 3 errors if it:
Has nan compound_ids
Has nan SMILES
Is not a many to one mapping between compound_ids and SMILES
utils.model_file_reader module¶
- class utils.model_file_reader.ModelFileReader(data_file_path)[source]¶
Bases:
object
A class to encapsulate a model’s metadata that you might want read out from a folder. Like read version number, get the dataset key, split uuid etc of a model.
- Attributes:
- Set in __init__:
data_file_path (str): a model data file or a directory that contains the model
- utils.model_file_reader.get_multiple_models_metadata(*args)[source]¶
A function that takes model tar.gz file(s) and extract the metadata (and if applicable, model metrics)
- Args:
*args: Variable length argument list of model tar.gz file(s)
- Returns:
a list of models’ most important model parameters and metrics. or an empty array if it fails to parse the input file(s).
- Exception:
IOError: Problem access the file or if fails to parse the input file to an AMPL model
utils.model_retrain module¶
- utils.model_retrain.train_model(input, output, dskey='', production=False)[source]¶
Retrain a model saved in a model_metadata.json file
- Args:
input (str): path to model_metadata.json file
output (str): path to output directory
dskey (str): new dataset key if file location has changed
production (bool): retrain the model using production mode
- Returns:
None
- utils.model_retrain.train_model_from_tar(input, output, dskey='', production=False)[source]¶
Retrain a model saved in a tar.gz file
- Args:
input (str): path to a tar.gz file
output (str): path to output directory
dskey (str): new dataset key if file location has changed
- Returns:
None
- utils.model_retrain.train_model_from_tracker(model_uuid, output_dir, production=False)[source]¶
Retrain a model saved in the model tracker, but save it to output_dir and don’t insert it into the model tracker
- Args:
model_uuid (str): model tracker model_uuid file
output_dir (str): path to output directory
- Returns:
the model pipeline object with trained model
- utils.model_retrain.train_models_from_dataset_keys(input, output, pred_type='regression', production=False)[source]¶
Retrain a list of models from an input file
- Args:
input (str): path to an Excel or csv file. the required columns are ‘dataset_key’ and ‘bucket’ (public, private_file or Filesystem).
output (str): path to output directory
pred_type (str, optional): set the model prediction type. if not, uses the default ‘regression’
- Returns:
None
utils.model_version_utils module¶
model_version_utils.py
Misc utilities to get the AMPL version(s) used to train one or more models and check them for compatibility with the currently running version of AMPL.:
To check the model version
usage: model_version_utils.py [-h] -i INPUT
- optional arguments:
- -h, --help
show this help message and exit
- -i INPUT, --input INPUT
input directory/file (required)
- utils.model_version_utils.check_version_compatible(input, ignore_check=False)[source]¶
Compare the input file’s version against the running AMPL version to see if they are compatible
- Args:
filename (str): file or version number
- Returns:
True if the input model version matches the compatible AMPL version group
- utils.model_version_utils.get_ampl_version()[source]¶
Get the running ampl version
- Returns:
the AMPL version
- utils.model_version_utils.get_ampl_version_from_dir(dirname)[source]¶
Get the AMPL versions for all the models stored under the given directory and its subdirectories, recursively.
- Args:
dirname (str): directory
- Returns:
list of AMPL versions
- utils.model_version_utils.get_ampl_version_from_json(metadata_path)[source]¶
Parse model_metadata.json to get the AMPL version
- Args:
filename (str): tar file
- Returns:
the AMPL version number
utils.pubchem_utils module¶
- utils.pubchem_utils.download_SID_from_bioactivity_assay(bioassayid)[source]¶
Retrieve summary info on bioactivity assays.
- Args:
a single bioactivity id: PubChem AIDs (bioactivity assay ids)
- Returns:
Returns the sids tested on this assay
- utils.pubchem_utils.download_activitytype(aid, sid)[source]¶
Retrieve data for assays for a select list of sids.
- Args:
myList (list): a bioactivity id (aid)
sidlst (list): list of sids specified as integers
- Returns:
Nothing returned yet, will return basic stats to help decide whether to use assay or not
- utils.pubchem_utils.download_bioactivity_assay(myList, intv=1)[source]¶
Retrieve summary info on bioactivity assays.
- Args:
myList (list): List of PubChem AIDs (bioactivity assay ids)
intv (1): number of INCHIKEYS to submit queries for in one request, default is 1
- Returns:
Nothing returned yet, will return basic stats to help decide whether to use assay or not
- utils.pubchem_utils.download_dose_response_from_bioactivity(aid, sidlst)[source]¶
Retrieve data for assays for a select list of sids.
- Args:
myList (list): a bioactivity id (aid)
sidlst (list): list of sids specified as integers
- Returns:
Nothing returned yet, will return basic stats to help decide whether to use assay or not
- utils.pubchem_utils.download_smiles(myList, intv=1)[source]¶
Retrieve canonical SMILES strings for a list of input INCHIKEYS. Will return only one SMILES string per INCHIKEY. If there are multiple values returned, the first is retained and the others are returned in a the discard_lst. INCHIKEYS that fail to return a SMILES string are put in the fail_lst
- Args:
myList (list): List of INCHIKEYS
intv (1): number of INCHIKEYS to submit queries for in one request, default is 1
- Returns:
list of SMILES strings corresponding to INCHIKEYS
list of INCHIKEYS, which failed to return a SMILES string
list of CIDs and SMILES, which were returned beyond the first CID and SMILE found for input INCHIKEY
utils.rdkit_easy module¶
utils.split_response_dist_plots module¶
Module to plot distributions of response values in each subset of a dataset generated by a split
- utils.split_response_dist_plots.get_split_labeled_dataset(params)[source]¶
Add a column to a dataset labeling the split subset for each row. Given a dataset and split parameters (including split_uuid) referenced in params, returns a data frame containing the dataset with an extra ‘split_subset’ column indicating the subset each data point belongs to. For standard 3-way splits, the labels will be ‘train’, ‘valid’ and ‘test’. For a k-fold CV split, the labels will be ‘fold_0’ through ‘fold_<k-1>’ and ‘test’.
- Args:
params (argparse.Namespace or dict): Structure containing dataset and split parameters. The following parameters are required, if not set to default values: - dataset_key - split_uuid - split_strategy - splitter - split_valid_frac - split_test_frac - num_folds - smiles_col - response_cols
- Returns:
A tuple (dset_df, split_label): - dset_df (DataFrame): The dataset specified by params.dataset_key, with additional column split_subset. - split_label (str): A short description of the split, useful for plot labeling.
- utils.split_response_dist_plots.plot_split_subset_response_distrs(params)[source]¶
Plot the distributions of the response variable(s) in each split subset of a dataset. Args:
params (argparse.Namespace or dict): Structure containing dataset and split parameters. The following parameters are required, if not set to default values: - dataset_key - split_uuid - split_strategy - splitter - split_valid_frac - split_test_frac - num_folds - smiles_col - response_cols
- Returns:
None
utils.struct_utils module¶
Functions to manipulate and convert between various representations of chemical structures: SMILES, InChi and RDKit Mol objects. Many of these functions (those with a ‘workers’ argument) accept either a single SMILES or InChi string or a list of strings as their first argument, and return a value with the same datatype. If a list is passed and the ‘workers’ argument is > 1, the calculation is parallelized across multiple threads; this can save significant time when operating on thousands of molecules.
- utils.struct_utils.base_mol_from_inchi(inchi_str, useIsomericSmiles=True, removeCharges=False)[source]¶
Generate a standardized RDKit Mol object for the largest fragment of the molecule specified by InChi string inchi_str. Replace any rare isotopes with the most common ones for each element. If removeCharges is True, add hydrogens as needed to eliminate charges.
- Args:
inchi_str (str): InChi string representing molecule.
useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated string.
removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.
- Returns:
str: Standardized salt-stripped SMILES string.
- utils.struct_utils.base_mol_from_smiles(orig_smiles, useIsomericSmiles=True, removeCharges=False)[source]¶
Generate a standardized RDKit Mol object for the largest fragment of the molecule specified by orig_smiles. Replace any rare isotopes with the most common ones for each element. If removeCharges is True, add hydrogens as needed to eliminate charges.
- Args:
orig_smiles (str): SMILES string to standardize.
useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated string.
removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.
- Returns:
str: Standardized salt-stripped SMILES string.
- utils.struct_utils.base_smiles_from_inchi(inchi_str, useIsomericSmiles=True, removeCharges=False, workers=1)[source]¶
Generate standardized salt-stripped SMILES strings for the largest fragments of each molecule represented by InChi string(s) inchi_str. Replaces any rare isotopes with the most common ones for each element.
- Args:
inchi_str (list or str): List of InChi strings to convert.
useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.
removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.
workers (int): Number of parallel threads to use for calculation.
- Returns:
list or str: Standardized SMILES strings.
- utils.struct_utils.base_smiles_from_smiles(orig_smiles, useIsomericSmiles=True, removeCharges=False, useCanonicalTautomers=False, workers=1)[source]¶
Generate standardized SMILES strings for the largest fragments of each molecule specified by orig_smiles. Strips salt groups and replaces any rare isotopes with the most common ones for each element.
- Args:
orig_smiles (list or str): List of SMILES strings to canonicalize.
useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.
removeCharges (bool): If true, add or remove hydrogens to produce uncharged molecules.
useCanonicalTautomers (bool): Whether to convert the generated SMILES to their canonical tautomers. Defaults to False for backward compatibility.
workers (int): Number of parallel threads to use for calculation.
- Returns:
list or str: Canonicalized SMILES strings.
- utils.struct_utils.canonical_tautomers_from_smiles(smiles)[source]¶
Returns SMILES strings for the canonical tautomers of a SMILES string or list of SMILES strings
- Args:
smiles (list or str): List of SMILES strings.
- Returns:
(list of str) : List of SMILES strings for the canonical tautomers.
- utils.struct_utils.draw_structure(smiles_str, image_path, image_size=500)[source]¶
Draw structure for the compound with the given SMILES string as a PNG file.
Note that there are more flexible functions for drawing structures in the rdkit_easy module. This function is only retained for backward compatibility.
- Args:
smiles_str (str): SMILES representation of compound.
image_path (str): Filepath for image file to be generated.
image_size (int): Width of square bounding box for image.
- Returns:
None.
- utils.struct_utils.fix_moe_smiles(smiles)[source]¶
Correct the SMILES strings generated by MOE to standardize the representation of protonated atoms, so that RDKit can read them.
- Args:
smiles (str): SMILES string.
- Returns:
str: The corrected SMILES string.
- utils.struct_utils.get_rdkit_smiles(orig_smiles, useIsomericSmiles=True)[source]¶
Given a SMILES string, regenerate a “canonical” SMILES string for the same molecule using the implementation in RDKit.
- Args:
orig_smiles (str): SMILES string to canonicalize.
useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated string.
- Returns:
str: Canonicalized SMILES string.
- utils.struct_utils.kekulize_smiles(orig_smiles, useIsomericSmiles=True, workers=1)[source]¶
Generate Kekulized SMILES strings for the molecules specified by orig_smiles. Kekulized SMILES strings are ones in which aromatic rings are represented by uppercase letters with alternating single and double bonds, rather than lowercase letters; they are needed by some external applications.
- Args:
orig_smiles (list or str): List of SMILES strings to Kekulize.
useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.
workers (int): Number of parallel threads to use for calculation.
- Returns:
list or str: Kekulized SMILES strings.
- utils.struct_utils.mol_wt_from_smiles(smiles, workers=1)[source]¶
Calculate molecular weights for molecules represented by SMILES strings.
- Args:
smiles (list or str): List of SMILES strings.
workers (int): Number of parallel threads to use for calculations.
- Returns:
list or float: Molecular weights. NaN is returned for SMILES strings that could not be read by RDKit.
- utils.struct_utils.mols_from_smiles(orig_smiles, workers=1)[source]¶
Parallel function to create RDKit Mol objects for a list of SMILES strings. If orig_smiles is a list and workers is > 1, spawn ‘workers’ threads to convert input SMILES strings to Mol objects.
- Args:
orig_smiles (list or str): List of SMILES strings to convert to Mol objects.
workers (int): Number of parallel threads to use for calculation.
- Returns:
list of rdkit.Chem.Mol: RDKit objects representing molecules.
- utils.struct_utils.rdkit_smiles_from_smiles(orig_smiles, useIsomericSmiles=True, useCanonicalTautomers=False, workers=1)[source]¶
Parallel version of get_rdkit_smiles. If orig_smiles is a list and workers is > 1, spawn ‘workers’ threads to convert input SMILES strings to standardized RDKit format.
- Args:
orig_smiles (list or str): List of SMILES strings to canonicalize.
useIsomericSmiles (bool): Whether to retain stereochemistry information in the generated strings.
useCanonicalTautomers (bool): Whether to convert the generated SMILES to their canonical tautomers. Defaults to False for backward compatibility.
workers (int): Number of parallel threads to use for calculation.
- Returns:
list or str: Canonicalized SMILES strings.
- utils.struct_utils.smiles_to_inchi_key(smiles)[source]¶
Generates an InChI key from a SMILES string. Note that an InChI key is different from an InChI string; it can be used as a unique identifier, but doesn’t hold the information needed to reconstruct a molecule.
- Args:
smiles (str): SMILES string.
- Returns:
str: An InChI key. Returns None if RDKit cannot convert the SMILES string to an RDKit Mol object.