Accelerating high-throughput virtual screening through molecular pool-based active learning

Structure-based virtual screening is an important tool in early stage drug discovery that scores the interactions between a target protein and candidate ligands. As virtual libraries continue to grow (in excess of $10^8$ molecules), so too do the resources necessary to conduct exhaustive virtual screening campaigns on these libraries. However, Bayesian optimization techniques can aid in their exploration: a surrogate structure-property relationship model trained on the predicted affinities of a subset of the library can be applied to the remaining library members, allowing the least promising compounds to be excluded from evaluation. In this study, we assess various surrogate model architectures, acquisition functions, and acquisition batch sizes as applied to several protein-ligand docking datasets and observe significant reductions in computational costs, even when using a greedy acquisition strategy; for example, 87.9% of the top-50000 ligands can be found after testing only 2.4% of a 100M member library. Such model-guided searches mitigate the increasing computational costs of screening increasingly large virtual libraries and can accelerate high-throughput virtual screening campaigns with applications beyond docking.

[1]  Regina Barzilay,et al.  Analyzing Learned Molecular Representations for Property Prediction , 2019, J. Chem. Inf. Model..

[2]  James Theiler,et al.  Adaptive Strategies for Materials Design using Uncertainties , 2016, Scientific Reports.

[3]  James Theiler,et al.  Accelerated search for materials with targeted properties by adaptive design , 2016, Nature Communications.

[4]  Ross D. King,et al.  Yeast-based automated high-throughput screens to identify anti-parasitic lead compounds , 2013, Open Biology.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Regina Barzilay,et al.  Uncertainty Quantification Using Neural Networks for Molecular Property Prediction , 2020, J. Chem. Inf. Model..

[7]  Stephani Joy Y Macalino,et al.  Role of computer-aided drug design in modern drug discovery , 2015, Archives of Pharmacal Research.

[8]  Robert Abel,et al.  Combining Cloud-Based Free-Energy Calculations, Synthetically Aware Enumerations, and Goal-Directed Generative Machine Learning for Rapid Large-Scale Chemical Exploration and Optimization , 2020, J. Chem. Inf. Model..

[9]  D. E. Clark,et al.  Virtual Screening: Is Bigger Always Better? Or Can Small Be Beautiful? , 2020, J. Chem. Inf. Model..

[10]  Robert Abel,et al.  Reaction-Based Enumeration, Active Learning, and Free Energy Calculations To Rapidly Explore Synthetically Tractable Chemical Space and Optimize Potency of Cyclin-Dependent Kinase 2 Inhibitors , 2019, J. Chem. Inf. Model..

[11]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[12]  Artem Cherkasov,et al.  Deep Docking - a Deep Learning Approach for Virtual Screening of Big Chemical Datasets , 2019 .

[13]  Zi Wang,et al.  Batched Large-scale Bayesian Optimization in High-dimensional Spaces , 2017, AISTATS.

[14]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[15]  Ji-Bo Wang,et al.  The Proximal Lilly Collection: Mapping, Exploring and Exploiting Feasible Chemical Space , 2016, J. Chem. Inf. Model..

[16]  J. Irwin,et al.  Docking Screens for Novel Ligands Conferring New Biology. , 2016, Journal of medicinal chemistry.

[17]  Chenru Duan,et al.  Accurate Multiobjective Design in a Space of Millions of Transition Metal Complexes with Neural-Network-Driven Efficient Global Optimization , 2020, ACS central science.

[18]  Muratahan Aykol,et al.  Autonomous intelligent agents for accelerated materials discovery , 2020, Chemical science.

[19]  S Irle,et al.  Supercomputer-Based Ensemble Docking Drug Discovery Pipeline with Application to Covid-19 , 2020, Journal of chemical information and modeling.

[20]  M. Gibbs,et al.  Efficient implementation of gaussian processes , 1997 .

[21]  Elizabeth Farrant,et al.  Integrated Synthesis and Testing of Substituted Xanthine Based DPP4 Inhibitors: Application to Drug Discovery. , 2013, ACS medicinal chemistry letters.

[22]  Edward O. Pyzer-Knapp Using Bayesian Optimization to Accelerate Virtual Screening for the Discovery of Therapeutics Appropriate for Repurposing for COVID-19 , 2020, ArXiv.

[23]  A. Weigend,et al.  Estimating the mean and variance of the target probability distribution , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[24]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[25]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[26]  Le Zhang,et al.  An Overview of Scoring Functions Used for Protein–Ligand Interactions in Molecular Docking , 2019, Interdisciplinary Sciences: Computational Life Sciences.

[27]  John J. Irwin,et al.  ZINC 15 – Ligand Discovery for Everyone , 2015, J. Chem. Inf. Model..

[28]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[29]  Arthur J. Olson,et al.  AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading , 2009, J. Comput. Chem..

[30]  Peter I. Frazier,et al.  A Tutorial on Bayesian Optimization , 2018, ArXiv.

[31]  Gus L. W. Hart,et al.  Accelerating high-throughput searches for new alloys with active learning of interatomic potentials , 2018, Computational Materials Science.

[32]  Yurii S. Moroz,et al.  Ultra-large library docking for discovering new chemotypes , 2019, Nature.

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  David A. Scott,et al.  An open-source drug discovery platform enables ultra-large virtual screens , 2020, Nature.

[35]  Evgenii Tsymbalov,et al.  Deeper Connections between Neural Networks and Gaussian Processes Speed-up Active Learning , 2019, IJCAI.

[36]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[37]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[38]  S. Prasad,et al.  Structure guided lead generation for M. tuberculosis thymidylate kinase (Mtb TMK): discovery of 3-cyanopyridone and 1,6-naphthyridin-2-one as potent inhibitors. , 2015, Journal of medicinal chemistry.

[39]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[40]  Alán Aspuru-Guzik,et al.  Parallel and Distributed Thompson Sampling for Large-scale Accelerated Exploration of Chemical Space , 2017, ICML.

[41]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[42]  Ross D. King,et al.  Cheaper faster drug development validated by the repositioning of drugs against neglected tropical diseases , 2015, Journal of The Royal Society Interface.

[43]  Vijay S. Pande,et al.  OpenMM 7: Rapid development of high performance algorithms for molecular dynamics , 2016, bioRxiv.