Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets

The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F 1 score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.

[1]  Yoshihiro Uesawa,et al.  Rigorous Selection of Random Forest Models for Identifying Compounds that Activate Toxicity-Related Pathways , 2016, Front. Environ. Sci..

[2]  Mathias Dunkel,et al.  Molecular similarity-based predictions of the Tox21 screening outcome , 2015, Front. Environ. Sci..

[3]  Valeria Vitelli,et al.  Probabilistic preference learning with the Mallows rank model , 2014, J. Mach. Learn. Res..

[4]  Chaoyang Zhang,et al.  Development of estrogen receptor beta binding prediction model using large sets of chemicals , 2017, Oncotarget.

[5]  Scott Boyer,et al.  Conformal Prediction Classification of a Large Data Set of Environmental Chemicals from ToxCast and Tox21 Estrogen Receptor Assays. , 2016, Chemical research in toxicology.

[6]  G. Hommel,et al.  Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses , 1988 .

[7]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[8]  Wojciech Czarnecki,et al.  Compounds Activity Prediction in Large Imbalanced Datasets with Substructural Relations Fingerprint and EEM , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[9]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[10]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Jürgen Bajorath,et al.  Evolving Concept of Activity Cliffs , 2019, ACS omega.

[13]  Maykel Cruz-Monteagudo,et al.  Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? , 2014, Drug discovery today.

[14]  Borja Calvo,et al.  scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems , 2016, R J..

[15]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[16]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[17]  Lars Carlsson,et al.  Applying Mondrian Cross-Conformal Prediction To Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets. , 2017, Journal of chemical information and modeling.

[18]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[19]  Jerzy Stefanowski,et al.  Dealing with Data Difficulty Factors While Learning from Imbalanced Data , 2016, Challenges in Computational Statistics and Data Mining.

[20]  Jian Pei,et al.  Data Mining : Concepts and Techniques 3rd edition Ed. 3 , 2011 .

[21]  Richard S. Judson,et al.  Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure-Activity Relationship and Machine Learning Methods , 2013, J. Chem. Inf. Model..

[22]  Fernando De la Torre,et al.  Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[23]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[24]  Youyong Li,et al.  ADMET Evaluation in Drug Discovery. 18. Reliable Prediction of Chemical-Induced Urinary Tract Toxicity by Boosting Machine Learning Approaches. , 2017, Molecular pharmaceutics.

[25]  Luís Torgo,et al.  A Survey of Predictive Modelling under Imbalanced Distributions , 2015, ArXiv.

[26]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[27]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[28]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[29]  Filip Stefaniak,et al.  Prediction of Compounds Activity in Nuclear Receptor Signaling and Stress Pathway Assays Using Machine Learning Algorithms and Low-Dimensional Molecular Descriptors , 2015, Front. Environ. Sci..

[30]  Chaoyang Zhang,et al.  Deep Learning-Based Structure-Activity Relationship Modeling for Multi-Category Toxicity Classification: A Case Study of 10K Tox21 Chemicals With High-Throughput Cell-Based Androgen Receptor Bioassay Data , 2019, Front. Physiol..

[31]  Igor V. Tetko,et al.  Consensus Modeling for HTS Assays Using In silico Descriptors Calculates the Best Balanced Accuracy in Tox21 Challenge , 2016, Front. Environ. Sci..

[32]  Hai Pham-The,et al.  Exploring different strategies for imbalanced ADME data problem: case study on Caco-2 permeability modeling , 2016, Molecular Diversity.

[33]  Andrew R. Leach,et al.  Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery , 2019, Journal of Cheminformatics.

[34]  Hisashi Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009, Stat. Anal. Data Min..

[35]  Ruili Huang,et al.  Editorial: Tox21 Challenge to Build Predictive Models of Nuclear Receptor and Stress Response Pathways As Mediated by Exposure to Environmental Toxicants and Drugs , 2017, Front. Environ. Sci..

[36]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[37]  Stephen J. Capuzzi,et al.  QSAR Modeling of Tox21 Challenge Stress Response and Nuclear Receptor Signaling Toxicity Assays , 2016, Front. Environ. Sci..

[38]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[39]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[40]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[41]  Andreas Bender,et al.  Concepts and Applications of Conformal Prediction in Computational Drug Discovery , 2019, ArXiv.

[42]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[43]  Jacek Tabor,et al.  Extreme entropy machines: robust information theoretic classification , 2015, Pattern Analysis and Applications.

[44]  Rachid Darnag,et al.  Support vector machines: development of QSAR models for predicting anti-HIV-1 activity of TIBO derivatives. , 2010, European journal of medicinal chemistry.

[45]  Robert Preissner,et al.  Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets , 2018, Front. Chem..

[46]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[47]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[48]  Weida Tong,et al.  Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models , 2003, J. Chem. Inf. Comput. Sci..

[49]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[50]  Taghi M. Khoshgoftaar,et al.  Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[51]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[52]  Marlene T. Kim,et al.  Predictive Modeling of Estrogen Receptor Binding Agents Using Advanced Cheminformatics Tools and Massive Public Data , 2016, Front. Environ. Sci..

[53]  Jerzy Stefanowski,et al.  Extending Bagging for Imbalanced Data , 2013, CORES.

[54]  Scott Boyer,et al.  Binary classification of imbalanced datasets using conformal prediction. , 2017, Journal of molecular graphics & modelling.

[55]  Victor Kuzmin,et al.  Application of Random Forest Approach to QSAR Prediction of Aquatic Toxicity , 2009, J. Chem. Inf. Model..

[56]  Navdeep Jaitly,et al.  Multi-task Neural Networks for QSAR Predictions , 2014, ArXiv.

[57]  Yuan Yan Tang,et al.  In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner. , 2012, Journal of molecular graphics & modelling.

[58]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[59]  Ruili Huang,et al.  Tox21Challenge to Build Predictive Models of Nuclear Receptor and Stress Response Pathways as Mediated by Exposure to Environmental Chemicals and Drugs , 2016, Front. Environ. Sci..

[60]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[61]  J. L. Hodges,et al.  Rank Methods for Combination of Independent Experiments in Analysis of Variance , 1962 .

[62]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[63]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[64]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[65]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[66]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[67]  G. Barta Identifying Biological Pathway Interrupting Toxins Using Multi-Tree Ensembles , 2016, Front. Environ. Sci..

[68]  Martin Krzywinski,et al.  Points of Significance: Ensemble methods: bagging and random forests , 2017, Nature Methods.

[69]  Daqi Gao,et al.  Classification for Imbalanced and Overlapping Classes Using Outlier Detection and Sampling Techniques , 2013 .

[70]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.