Validating the validation: reanalyzing a large-scale comparison of deep learning and machine learning models for bioactivity prediction

Machine learning methods may have the potential to significantly accelerate drug discovery. However, the increasing rate of new methodological approaches being published in the literature raises the fundamental question of how models should be benchmarked and validated. We reanalyze the data generated by a recently published large-scale comparison of machine learning models for bioactivity prediction and arrive at a somewhat different conclusion. We show that the performance of support vector machines is competitive with that of deep learning methods. Additionally, using a series of numerical experiments, we question the relevance of area under the receiver operating characteristic curve as a metric in virtual screening. We further suggest that area under the precision–recall curve should be used in conjunction with the receiver operating characteristic curve. Our numerical experiments also highlight challenges in estimating the uncertainty in model performance via scaffold-split nested cross validation.

[1]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[2]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[3]  J. Ruscio,et al.  A probability-based measure of effect size: robustness to base rates and other factors. , 2008, Psychological methods.

[5]  Iñaki Inza,et al.  Dealing with the evaluation of supervised classification algorithms , 2015, Artificial Intelligence Review.

[6]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[7]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[8]  Anthony Nicholls,et al.  Confidence limits, error bars and method comparison in molecular modeling. Part 1: The calculation of confidence intervals , 2014, Journal of Computer-Aided Molecular Design.

[9]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10]  Matthias Rarey,et al.  In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening , 2019, J. Chem. Inf. Model..

[11]  Francisco Herrera,et al.  A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms , 2011, Swarm Evol. Comput..

[12]  Jr. William Rush Dunton,et al.  THE AMERICAN JOURNAL OF PSYCHIATRY , 1944 .

[13]  Jacek M. Zurada,et al.  Swarm and Evolutionary Computation , 2012, Lecture Notes in Computer Science.

[14]  Hugo Ceulemans,et al.  Large-scale comparison of machine learning methods for drug target prediction on ChEMBL , 2018, Chemical science.

[15]  William L. Jorgensen,et al.  Journal of Chemical Information and Modeling , 2005, J. Chem. Inf. Model..

[16]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[17]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[18]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[19]  Anthony Nicholls,et al.  What do we know and when do we know it? , 2008, J. Comput. Aided Mol. Des..

[20]  Gregory A Landrum,et al.  Is that a scientific publication or an advertisement? Reproducibility, source code and data in the computational chemistry literature. , 2012, Future medicinal chemistry.

[21]  Peter A. Flach,et al.  Precision-Recall-Gain Curves: PR Analysis Done Right , 2015, NIPS.

[22]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[23]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[24]  Mohak Shah,et al.  Evaluating Learning Algorithms: A Classification Perspective , 2011 .

[25]  W. Patrick Walters,et al.  Modeling, Informatics, and the Quest for Reproducibility , 2013, J. Chem. Inf. Model..

[26]  Abhinav Vishnu,et al.  Deep learning for computational chemistry , 2017, J. Comput. Chem..

[27]  Patrick F Sullivan,et al.  No Support for Historical Candidate Gene or Candidate Gene-by-Interaction Hypotheses for Major Depression Across Multiple Large Samples. , 2019, The American journal of psychiatry.

[28]  Anthony Nicholls,et al.  Confidence limits, error bars and method comparison in molecular modeling. Part 2: comparing methods , 2016, Journal of Computer-Aided Molecular Design.

[29]  Gaël Varoquaux,et al.  Cross-validation failure: Small sample sizes lead to large error bars , 2017, NeuroImage.

[30]  Izhar Wallach,et al.  Most Ligand-Based Benchmarks Measure Overfitting Rather than Accuracy , 2017, J. Chem. Inf. Model..

[31]  Joseph Gomes,et al.  MoleculeNet: a benchmark for molecular machine learning† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc02664a , 2017, Chemical science.