Adding stochastic negative examples into machine learning improves molecular bioactivity prediction

Multitask deep neural networks learn to predict ligand-target binding by example, yet public pharmacological datasets are sparse, imbalanced, and approximate. We constructed two hold-out benchmarks to approximate temporal and drug-screening test scenarios whose characteristics differ from a random split of conventional training datasets. We developed a pharmacological dataset augmentation procedure, Stochastic Negative Addition (SNA), that randomly assigns untested molecule-target pairs as transient negative examples during training. Under the SNA procedure, ligand drug-screening benchmark performance increases from R2 = 0.1926 ± 0.0186 to 0.4269±0.0272 (121.7%). This gain was accompanied by a modest decrease in the temporal benchmark (13.42%). SNA increases in drug-screening performance were consistent for classification and regression tasks and outperformed scrambled controls. Our results highlight where data and feature uncertainty may be problematic, but also show how leveraging uncertainty into training improves predictions of drug-target relationships.

[1]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[2]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[3]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[4]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[5]  Regina Barzilay,et al.  Analyzing Learned Molecular Representations for Property Prediction , 2019, J. Chem. Inf. Model..

[6]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[7]  Hao Ding,et al.  Similarity-based machine learning methods for predicting drug-target interactions: a brief review , 2014, Briefings Bioinform..

[8]  Andreas Bender,et al.  Target prediction utilising negative bioactivity data covering large chemical space , 2015, Journal of Cheminformatics.

[9]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[10]  Michael J. Keiser,et al.  A simple representation of three-dimensional molecular structure , 2017, bioRxiv.

[11]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[12]  Valerie J Gillet,et al.  Effect of missing data on multitask prediction methods , 2018, Journal of Cheminformatics.

[13]  Kathrin Heikamp,et al.  Comparison of Confirmed Inactive and Randomly Selected Compounds as Negative Training Examples in Support Vector Machine-Based Virtual Screening , 2013, J. Chem. Inf. Model..

[14]  Andrzej J. Bojarski,et al.  The influence of negative training set size on machine learning-based virtual screening , 2014, Journal of Cheminformatics.

[15]  Gerta Rücker,et al.  y-Randomization and Its Variants in QSPR/QSAR , 2007, J. Chem. Inf. Model..

[16]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[17]  Lucy Colwell,et al.  The Effect of Debiasing Protein Ligand Binding Data on Generalisation. , 2019, Journal of chemical information and modeling.

[18]  Hugo Ceulemans,et al.  Large-scale comparison of machine learning methods for drug target prediction on ChEMBL , 2018, Chemical science.

[19]  C. Parsons THE AMERICAN CHEMICAL SOCIETY. , 1922, Science.

[20]  M. Withnall,et al.  Building attention and edge message passing neural networks for bioactivity and physical–chemical property prediction , 2020, Journal of Cheminformatics.

[21]  Yang Li,et al.  PotentialNet for Molecular Property Prediction , 2018, ACS central science.

[22]  Benedict W J Irwin,et al.  Imputation of Assay Bioactivity Data Using Deep Learning , 2019, J. Chem. Inf. Model..

[23]  Scott S. Auerbach,et al.  An Overview of National Toxicology Program’s Toxicogenomic Applications: DrugMatrix and ToxFX , 2019, Challenges and Advances in Computational Chemistry and Physics.

[24]  Chen Huang,et al.  Learning Deep Representation for Imbalanced Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[26]  Vijay S. Pande,et al.  Massively Multitask Networks for Drug Discovery , 2015, ArXiv.

[27]  Robert P. Sheridan,et al.  Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction , 2013, J. Chem. Inf. Model..

[28]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[29]  G. Schneider,et al.  Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. , 2019, Chemical reviews.

[30]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[31]  Michael J. Keiser,et al.  Predicting new molecular targets for known drugs , 2009, Nature.

[32]  Ubbo Visser,et al.  BioAssay Ontology (BAO): a semantic description of bioassays and high-throughput screening results , 2011, BMC Bioinformatics.

[33]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34]  Izhar Wallach,et al.  Most Ligand-Based Benchmarks Measure Overfitting Rather than Accuracy , 2017, J. Chem. Inf. Model..

[35]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[36]  Michael J. Keiser,et al.  Adversarial Controls for Scientific Machine Learning. , 2018, ACS chemical biology.

[37]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[38]  Dealing with a data dilemma , 2008, Nature Reviews Drug Discovery.

[39]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[40]  Andrea Volkamer,et al.  Advances and Challenges in Computational Target Prediction , 2019, J. Chem. Inf. Model..

[41]  Rafał Kurczab,et al.  The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening , 2017, PloS one.

[42]  Michael J. Keiser,et al.  Large Scale Prediction and Testing of Drug Activity on Side-Effect Targets , 2012, Nature.

[43]  M. Fielden,et al.  Development of a large-scale chemogenomics database to improve drug candidate selection and to understand mechanisms of chemical toxicity and action. , 2005, Journal of biotechnology.

[44]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[45]  Pierre Baldi,et al.  Accurate and efficient target prediction using a potency-sensitive influence-relevance voter , 2015, Journal of Cheminformatics.

[46]  Piotr F J Lipiński,et al.  SCRAMBLE’N’GAMBLE: a tool for fast and facile generation of random data for statistical evaluation of QSAR models , 2017, Chemical Papers.

[47]  Emma J. Chory,et al.  A Deep Learning Approach to Antibiotic Discovery , 2020, Cell.