Post

Molecule Property Prediction Datasets

Molecule property prediction is a task of predicting properties of molecules based on their structures. These predictions are essential in drug design, substance discovery, and chemical research. Researchers use molecular representations and machine learning techniques to analyze and forecast properties like toxicity, biological activity, and physical characteristics of molecules. By leveraging advanced algorithms and molecular data, scientists aim to enhance chemical design, reduce research costs, accelerate drug discovery processes, and improve the accuracy of predicting molecular behaviors.

Biogen

Paper: Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective
Dataset: Computational-ADME

Dataset consisting of in-vitro ADME experiments for a set of nonproprietary small-molecule compounds which were selected from commercial vendor libraries.

Dataset Measurement No. of compounds Relevance
Human Liver Microsomal (HLM) stability reported as intrinsic clearance (Clint, mL/min/kg) 3,087 The intrinsic clearance measured by HLM stability is crucial for understanding how a drug candidate is metabolized and cleared in the human body. This information is essential for predicting the pharmacokinetic profile of the drug and its potential efficacy and safety.
Rat Liver Microsomal (RLM) stability reported as intrinsic clearance (Clint, mL/min/kg) 3,054 Similar to HLM stability, RLM stability provides important information about the metabolism and clearance of a compound in rats. This data is valuable for predicting the pharmacokinetic behavior of the compound in preclinical studies.
MDR1-MDCK Efflux Ratio (MDR1-MDCK ER)   2,642 This endpoint provides insights into the compound’s potential to be transported out of cells by efflux pumps. Understanding the efflux ratio is important for predicting the compound’s bioavailability and potential for drug-drug interactions.
Aqueous Solubility pH 6.8 (ug/mL) 2,173 The solubility of a compound is a critical factor in drug absorption. Poor solubility can lead to challenges in formulating the drug and may impact its bioavailability, which is essential for ensuring the drug reaches its target site in the body.
Human Plasma Protein Binding (hPPB) percent unbound 194 Understanding the percentage of unbound compound in human plasma is essential for predicting the distribution and potential interactions of the drug with plasma proteins. This information is important for assessing the compound’s pharmacokinetic properties.
Rat Plasma Protein Binding (rPPB) percent unbound 168 Similar to hPPB, rPPB provides insights into the distribution and interactions of the compound in rats. Understanding the percentage of unbound compound in rat plasma is important for predicting the compound’s pharmacokinetic behavior in preclinical studies.

Flaws with MoleculeNet Dataset

Blog: We Need Better Benchmarks for Machine Learning in Drug Discovery
Dataset: MoleculeNet

Blood Brain Barrier permeability (BBBP) Dataset

  • Invalid structures: 11 SMILES with uncharged tetravalent nitrogen atoms. A tetravalent nitrogen should always have a charge. As a result of these errors, popular Cheminformatics toolkits like the RDKit cannot parse the structures.
  • Inconsistent chemical representation: There are inconsistencies in how the carboxylic acid moiety (a function group) is represented for the 59 beta-lactam antibiotics in the BBB dataset. This inconsistency arises from using three different forms: the protonated acid, the anionic carboxylate, and the anionic salt form.
  • Data curation errors: The dataset contains 59 duplicate structures. 10 of these duplicate structures have different labels where the same molecule is labeled as both BBB penetrant and BBB non-penetrant. Incorrect labels being used.
  • Incorrect endpoints: Many authors have simply classified any drugs used for psychiatric indications or drugs with side effects such as drowsiness as CNS penetrant. Experimental measurements of BBB penetration are widely variable.

BACE Dataset

  • Stereochemistry issues: Stereoisomers can have vastly different properties and biological activities. 71% of the molecules have at least one undefined stereocenter, 222 molecules have 3 undefined stereocenters, and one molecule has 12 undefined stereocenters.  Ideally, benchmark datasets should consist of achiral or chirally pure molecules with clearly defined stereocenters.
  • Inconsistent measurements: BACE dataset was collected from 55 papers.  It is highly unlikely that the authors of these 55 papers employed the same experimental procedures to determine IC50s. It is important to clearly understand the relationships between model accuracy and experimental error.
  • Cutoffs: Molecules with IC50 values less than 200nM are considered active, and molecules with IC50 values greater than or equal to 200nM are considered inactive.  200nM is an odd choice for an activity cutoff.  It is quite a bit more potent than the values one finds with screening hits, which typically have IC50s in the single to double-digit µM range.  The cutoff is also 10-20x greater than an IC50 one would target in lead optimization.

ESOL Aqueous Solubility Dataset

  • Realistic dynamic range: The dynamic range of the data used in benchmarks should closely approximate the ranges one encounters in practice. Most pharmaceutical compounds tend to have solubilities somewhere between 1 and 500 µM. Assays are typically run within this relatively narrow range spanning 2.5 to 3 logs. This narrow dynamic range makes achieving good correlations between experimental and predicted values difficult. The ESOL dataset in MoleculeNet spans more than 13 logs, and it is easy to get a good correlation with very simple models. Unfortunately, this performance doesn’t reflect what one sees when predicting realistic test sets.

FreeSolv Dataset

  • Irrelevant if used in isolation: This dataset was designed to evaluate the ability of molecular dynamics simulations to estimate the free energy of solvation, which is an essential component of free energy calculations. However, this quantity, in and of itself, is not particularly useful if used in isolation.

HIV Dataset

  • Assay having high number of artifacts: The dataset consists of binary labels derived from 40,000 compounds tested in a cell assay designed to identify molecules that can inhibit HIV replication.  70% of molecules labeled as “confirmed active” (CA) trigger one or more structural alerts.  Of the 404 molecules labeled as CA, 68 are azo dyes widely known to be cytotoxic and generate assay interference. Structural alerts provide a means of identifying and potentially eliminating some of these potentially problematic molecules.

Toxicity Datasets

SIDER

  • Contains less scientific categories such as  “Product issues”, “Investigations”, and “Social circumstances”.
  • Lack of mechanistic information associated with the side effects.

Toxcast

  • Dataset is not a complete matrix; assay columns typically have ~2,500 binary labels and ~6,000 missing values.
  • The dataset consists of many “bad actors”—56% of the molecules in the set trigger one or more structural alerts.

Tox21

  • The data here is less sparse than the Toxcast dataset, with between 575 and 2,104 missing values per assay.

Clintox

  • Lack of specific information and the multitude of mechanisms for clinical toxicity observations.
This post is licensed under CC BY 4.0 by the author.