Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure-Activity Relationship and Machine Learning Methods

There are thousands of environmental chemicals subject to regulatory decisions for endocrine disrupting potential. The ToxCast and Tox21 programs have tested ∼8200 chemicals in a broad screening panel of in vitro high-throughput screening (HTS) assays for estrogen receptor (ER) agonist and antagonist activity. The present work uses this large data set to develop in silico quantitative structure-activity relationship (QSAR) models using machine learning (ML) methods and a novel approach to manage the imbalanced data distribution. Training compounds from the ToxCast project were categorized as active or inactive (binding or nonbinding) classes based on a composite ER Interaction Score derived from a collection of 13 ER in vitro assays. A total of 1537 chemicals from ToxCast were used to derive and optimize the binary classification models while 5073 additional chemicals from the Tox21 project, evaluated in 2 of the 13 in vitro assays, were used to externally validate the model performance. In order to handle the imbalanced distribution of active and inactive chemicals, we developed a cluster-selection strategy to minimize information loss and increase predictive performance and compared this strategy to three currently popular techniques: cost-sensitive learning, oversampling of the minority class, and undersampling of the majority class. QSAR classification models were built to relate the molecular structures of chemicals to their ER activities using linear discriminant analysis (LDA), classification and regression trees (CART), and support vector machines (SVM) with 51 molecular descriptors from QikProp and 4328 bits of structural fingerprints as explanatory variables. A random forest (RF) feature selection method was employed to extract the structural features most relevant to the ER activity. The best model was obtained using SVM in combination with a subset of descriptors identified from a large set via the RF algorithm, which recognized the active and inactive compounds at the accuracies of 76.1% and 82.8% with a total accuracy of 81.6% on the internal test set and 70.8% on the external test set. These results demonstrate that a combination of high-quality experimental data and ML methods can lead to robust models that achieve excellent predictive accuracy, which are potentially useful for facilitating the virtual screening of chemicals for environmental risk assessment.

[1]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[2]  P Gramatica,et al.  QSAR classification of estrogen receptor binders and pre-screening of potential pleiotropic EDCs , 2010, SAR and QSAR in environmental research.

[3]  Julie Clark,et al.  Discovery of Novel Antimalarial Compounds Enabled by QSAR-Based Virtual Screening , 2013, J. Chem. Inf. Model..

[4]  Johannes Grotendorst,et al.  Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques , 2007, J. Chem. Inf. Model..

[5]  C W Yap,et al.  Classification of a diverse set of Tetrahymena pyriformis toxicity chemical compounds from molecular descriptors by statistical learning methods. , 2006, Chemical research in toxicology.

[6]  Ann Richard,et al.  Advancing Exposure Characterization for Chemical Evaluation and Risk Assessment , 2010, Journal of toxicology and environmental health. Part B, Critical reviews.

[7]  D. Mital,et al.  Identification of heparin samples that contain impurities or contaminants by chemometric pattern recognition analysis of proton NMR spectral data , 2011, Analytical and bioanalytical chemistry.

[8]  R. Judson,et al.  The Toxicity Data Landscape for Environmental Chemicals , 2008, Environmental health perspectives.

[9]  David M. Reif,et al.  Aggregating Data for Computational Toxicology Applications: The U.S. Environmental Protection Agency (EPA) Aggregated Computational Toxicology Resource (ACToR) System , 2012, International journal of molecular sciences.

[10]  Robert C. Glen,et al.  Random Forest Models To Predict Aqueous Solubility , 2007, J. Chem. Inf. Model..

[11]  Bo-Han Su,et al.  A comprehensive support vector machine binary hERG classification model based on extensive but biased end point hERG data sets. , 2011, Chemical research in toxicology.

[12]  Robert J Kavlock,et al.  Integration of dosimetry, exposure, and high-throughput screening data in chemical toxicity assessment. , 2012, Toxicological sciences : an official journal of the Society of Toxicology.

[13]  Zhen Li,et al.  A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model , 2008, BMC Bioinformatics.

[14]  Ruili Huang,et al.  Using in Vitro High Throughput Screening Assays to Identify Potential Endocrine-Disrupting Chemicals , 2012, Environmental health perspectives.

[15]  Vasantha Padmanabhan,et al.  Developmental programming: impact of fetal exposure to endocrine-disrupting chemicals on gonadotropin-releasing hormone and estrogen receptor mRNA in sheep hypothalamus. , 2010, Toxicology and applied pharmacology.

[16]  James Vail,et al.  The exposure data landscape for manufactured chemicals. , 2012, The Science of the total environment.

[17]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[18]  Z R Li,et al.  Prediction of genotoxicity of chemical compounds by statistical learning methods. , 2005, Chemical research in toxicology.

[19]  David M. Reif,et al.  Endocrine Profiling and Prioritization of Environmental Chemicals Using ToxCast Data , 2010, Environmental health perspectives.

[20]  Xin Chen,et al.  Effect of Molecular Descriptor Feature Selection in Support Vector Machine Classification of Pharmacokinetic and Toxicological Properties of Chemical Agents , 2004, J. Chem. Inf. Model..

[21]  David M. Reif,et al.  In Vitro Screening of Environmental Chemicals for Targeted Testing Prioritization: The ToxCast Project , 2009, Environmental health perspectives.

[22]  R. Judson,et al.  Estimating toxicity-related biological pathway altering doses for high-throughput chemical risk assessment. , 2011, Chemical research in toxicology.

[23]  Weida Tong,et al.  Prediction of estrogen receptor binding for 58,000 chemicals using an integrated system of a tree-based model with structural alerts. , 2001, Environmental health perspectives.

[24]  D. Dix,et al.  The ToxCast program for prioritizing toxicity testing of environmental chemicals. , 2007, Toxicological sciences : an official journal of the Society of Toxicology.

[25]  Yuan Yan Tang,et al.  In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner. , 2012, Journal of molecular graphics & modelling.

[26]  William J. Welsh,et al.  Determination of galactosamine impurities in heparin samples by multivariate regression analysis of their 1H NMR spectra , 2011, Analytical and bioanalytical chemistry.

[27]  Alexander Golbraikh,et al.  A Novel Two-Step Hierarchical Quantitative Structure–Activity Relationship Modeling Work Flow for Predicting Acute Toxicity of Chemicals in Rodents , 2009, Environmental health perspectives.

[28]  Feng Luan,et al.  Classification of the carcinogenicity of N-nitroso compounds based on support vector machines and linear discriminant analysis. , 2005, Chemical research in toxicology.

[29]  et al.,et al.  In-silico predictive mutagenicity model generation using supervised learning approaches , 2012, Journal of Cheminformatics.

[30]  Emilio Xavier Esposito,et al.  The great descriptor melting pot: mixing descriptors for the common good of QSAR models , 2011, Journal of Computer-Aided Molecular Design.

[31]  Z. R. Li,et al.  Prediction of estrogen receptor agonists and characterization of associated molecular descriptors by statistical learning methods. , 2006, Journal of molecular graphics & modelling.

[32]  C. Sonnenschein,et al.  Environmental causes of cancer: endocrine disruptors as carcinogens , 2010, Nature Reviews Endocrinology.

[33]  Ramón Díaz-Uriarte,et al.  GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest , 2007, BMC Bioinformatics.

[34]  Ann Richard,et al.  ACToR--Aggregated Computational Toxicology Resource. , 2008, Toxicology and applied pharmacology.

[35]  Olivier Taboureau,et al.  Classification of Cytochrome P450 1A2 Inhibitors and Noninhibitors by Machine Learning Techniques , 2009, Drug Metabolism and Disposition.

[36]  Robert J Kavlock,et al.  Impact of environmental chemicals on key transcription regulators and correlation to toxicity end points within EPA's ToxCast program. , 2010, Chemical research in toxicology.

[37]  Ivan Rusyn,et al.  Identification of putative estrogen receptor-mediated endocrine disrupting chemicals using QSAR- and structure-based virtual screening approaches. , 2013, Toxicology and applied pharmacology.

[38]  Yanli Wang,et al.  Binary Classification of Aqueous Solubility Using Support Vector Machines with Reduction and Recombination Feature Selection , 2011, J. Chem. Inf. Model..

[39]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[40]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[41]  Michael C Hutter,et al.  Selecting Relevant Descriptors for Classification by Bayesian Estimates: A Comparison with Decision Trees and Support Vector Machines Approaches for Disparate Data Sets , 2011, Molecular informatics.

[42]  Linda S Birnbaum,et al.  Cancer and developmental exposure to endocrine disruptors. , 2002, Environmental health perspectives.

[43]  Min Wang,et al.  Prediction of antibacterial compounds by machine learning approaches , 2009, J. Comput. Chem..

[44]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[45]  Lucinda F Buhse,et al.  Class modeling analysis of heparin 1H NMR spectral data using the soft independent modeling of class analogy and unequal class modeling techniques. , 2011, Analytical chemistry.

[46]  Y. Heyden,et al.  Classification models for neocryptolepine derivatives as inhibitors of the β-haematin formation. , 2011, Analytica chimica acta.

[47]  Christodoulos A Floudas,et al.  A novel framework for predicting in vivo toxicities from in vitro data using optimal methods for dense and sparse matrix reordering and logistic regression. , 2010, Toxicological sciences : an official journal of the Society of Toxicology.

[48]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[49]  Emilio Xavier Esposito,et al.  Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods , 2013, J. Chem. Inf. Model..

[50]  Lucinda F Buhse,et al.  Combining (1)H NMR spectroscopy and chemometrics to identify heparin samples that may possess dermatan sulfate (DS) impurities or oversulfated chondroitin sulfate (OSCS) contaminants. , 2011, Journal of pharmaceutical and biomedical analysis.

[51]  Yan Li,et al.  A Classification Study of Respiratory Syncytial Virus (RSV) Inhibitors by Variable Selection with Random Forest , 2011, International journal of molecular sciences.

[52]  I. Rusyn,et al.  Use of in Vitro HTS-Derived Concentration–Response Data as Biological Descriptors Improves the Accuracy of QSAR Models of in Vivo Toxicity , 2010, Environmental health perspectives.

[53]  Yanli Wang,et al.  A novel method for mining highly imbalanced high-throughput screening data in PubChem , 2009, Bioinform..

[54]  Bo-Han Su,et al.  In Silico Binary Classification QSAR Models Based on 4D-Fingerprints and MOE Descriptors for Prediction of hERG Blockage , 2010, J. Chem. Inf. Model..

[55]  Yan Li,et al.  A classification study of human β3-adrenergic receptor agonists using BCUT descriptors , 2011, Molecular Diversity.

[56]  David M. Reif,et al.  Activity profiles of 309 ToxCast™ chemicals evaluated across 292 biochemical targets. , 2011, Toxicology.

[57]  David Dix,et al.  Computational Toxicology as Implemented by the U.S. EPA: Providing High Throughput Decision Support Tools for Screening and Assessing Chemical Exposure, Hazard and Risk , 2010, Journal of toxicology and environmental health. Part B, Critical reviews.