Applying Mondrian Cross-Conformal Prediction To Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets.

Conformal prediction has been proposed as a more rigorous way to define prediction confidence compared to other application domain concepts that have earlier been used for QSAR modeling. One main advantage of such a method is that it provides a prediction region potentially with multiple predicted labels, which contrasts to the single valued (regression) or single label (classification) output predictions by standard QSAR modeling algorithms. Standard conformal prediction might not be suitable for imbalanced data sets. Therefore, Mondrian cross-conformal prediction (MCCP) which combines the Mondrian inductive conformal prediction with cross-fold calibration sets has been introduced. In this study, the MCCP method was applied to 18 publicly available data sets that have various imbalance levels varying from 1:10 to 1:1000 (ratio of active/inactive compounds). Our results show that MCCP in general performed well on bioactivity data sets with various imbalance levels. More importantly, the method not only provides confidence of prediction and prediction regions compared to standard machine learning methods but also produces valid predictions for the minority class. In addition, a compound similarity based nonconformity measure was investigated. Our results demonstrate that although it gives valid predictions, its efficiency is much worse than that of model dependent metrics.

[1]  Vladimir Vovk,et al.  Cross-conformal predictors , 2012, Annals of Mathematics and Artificial Intelligence.

[2]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[3]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[4]  Robert P. Sheridan,et al.  Three Useful Dimensions for Domain Applicability in QSAR Models Using Random Forest , 2012, J. Chem. Inf. Model..

[5]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[6]  Vladimir Vovk,et al.  Mondrian Confidence Machine , 2003 .

[7]  Lars Carlsson,et al.  Erratum to: ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics , 2017, Journal of Cheminformatics.

[8]  Igor V. Tetko,et al.  Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set , 2010, J. Chem. Inf. Model..

[9]  Henrik Boström,et al.  Conformal Prediction Using Decision Trees , 2013, 2013 IEEE 13th International Conference on Data Mining.

[10]  Igor Kononenko,et al.  Comparison of approaches for estimating reliability of individual regression predictions , 2008, Data Knowl. Eng..

[11]  Alexander Gammerman,et al.  Applying Conformal Prediction to the Bovine TB Diagnosing , 2011, EANN/AIAI.

[12]  Ralph Kühne,et al.  Chemical Domain of QSAR Models from Atom-Centered Fragments , 2009, J. Chem. Inf. Model..

[13]  Harris Papadopoulos,et al.  Inductive Confidence Machines for Regression , 2002, ECML.

[14]  Lars Carlsson,et al.  Stereo Signature Molecular Descriptor , 2013, J. Chem. Inf. Model..

[15]  Robert P. Sheridan,et al.  Using Random Forest To Model the Domain Applicability of Another Random Forest Model , 2013, J. Chem. Inf. Model..

[16]  Shane Weaver,et al.  The importance of the domain of applicability in QSAR modeling. , 2008, Journal of molecular graphics & modelling.

[17]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[18]  Lars Carlsson,et al.  ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics , 2017, Journal of Cheminformatics.

[19]  Andreas Zell,et al.  jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints , 2011, J. Cheminformatics.

[20]  Alexander Gammerman,et al.  Machine-Learning Applications of Algorithmic Randomness , 1999, ICML.

[21]  Jens Meiler,et al.  Identification of Metabotropic Glutamate Receptor Subtype 5 Potentiators Using Virtual High-Throughput Screening , 2010, ACS chemical neuroscience.

[22]  W. Gasarch,et al.  The Book Review Column 1 Coverage Untyped Systems Simple Types Recursive Types Higher-order Systems General Impression 3 Organization, and Contents of the Book , 2022 .

[23]  Scott Boyer,et al.  Binary classification of imbalanced datasets using conformal prediction. , 2017, Journal of molecular graphics & modelling.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Igor V. Tetko,et al.  BIGCHEM: Challenges and Opportunities for Big Data Analysis in Chemistry , 2016, Molecular informatics.

[26]  Scott Boyer,et al.  Introducing conformal prediction in predictive modeling for regulatory purposes. A transparent and flexible alternative to applicability domain determination. , 2015, Regulatory toxicology and pharmacology : RTP.

[27]  Stefano Moro,et al.  Pharmaceutical Perspectives of Nonlinear QSAR Strategies , 2010, J. Chem. Inf. Model..

[28]  Robert P. Sheridan,et al.  The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity , 2015, J. Chem. Inf. Model..

[29]  Scott Boyer,et al.  Assessment of Machine Learning Reliability Methods for Quantifying the Applicability Domain of QSAR Regression Models , 2014, J. Chem. Inf. Model..

[30]  Vladimir Vovk,et al.  Transductive conformal predictors , 2015, AIAI.

[31]  Robert D. Clark,et al.  DPRESS: Localizing estimates of predictive uncertainty , 2009, J. Cheminformatics.

[32]  Scott Boyer,et al.  Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination , 2014, J. Chem. Inf. Model..

[33]  Vladimir Vovk,et al.  A tutorial on conformal prediction , 2007, J. Mach. Learn. Res..

[34]  Alexander Gammerman,et al.  Conformal Predictors for Compound Activity Prediction , 2016, COPA.

[35]  Lars Carlsson,et al.  QSAR with experimental and predictive distributions: an information theoretic approach for assessing model quality , 2013, Journal of Computer-Aided Molecular Design.