DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design

The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate compound’s interaction with the target. By contrast, molecular docking is a widely applied method in drug discovery to estimate binding affinities. However, docking studies require a significant amount of domain knowledge to set up correctly, which hampers adoption. Here, we present dockstring, a bundle for meaningful and robust comparison of ML models using docking scores. dockstring consists of three components: (1) an open-source Python package for straightforward computation of docking scores, (2) an extensive dataset of docking scores and poses of more than 260,000 molecules for 58 medically relevant targets, and (3) a set of pharmaceutically relevant benchmark tasks such as virtual screening or de novo design of selective kinase inhibitors. The Python package implements a robust ligand and target preparation protocol that allows nonexperts to obtain meaningful docking scores. Our dataset is the first to include docking poses, as well as the first of its size that is a full matrix, thus facilitating experiments in multiobjective optimization and transfer learning. Overall, our results indicate that docking scores are a more realistic evaluation objective than simple physicochemical properties, yielding benchmark tasks that are more challenging and more closely related to real problems in drug discovery.

[1]  Reed M. Stein,et al.  A practical guide to large-scale docking , 2021, Nature Protocols.

[2]  Eva Nittinger,et al.  DockStream: a docking wrapper to enhance de novo molecular design , 2021, Journal of Cheminformatics.

[3]  Gianni De Fabritiis,et al.  Structure based virtual screening: Fast and slow , 2021, WIREs Computational Molecular Science.

[4]  Melissa F. Adasme,et al.  PLIP 2021: expanding the scope of the protein–ligand interaction profiler to DNA and RNA , 2021, Nucleic Acids Res..

[5]  Andreas Bender,et al.  Comparison of structure- and ligand-based scoring functions for deep generative models: a GPCR case study , 2021, Journal of Cheminformatics.

[6]  Jimeng Sun,et al.  Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics , 2021, ArXiv.

[7]  E. Schönbrunn,et al.  Structural Insights into JAK2 Inhibition by Ruxolitinib, Fedratinib, and Derivatives Thereof. , 2021, Journal of Medicinal Chemistry.

[8]  Gabriel dos Passos Gomes,et al.  Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES , 2020, Chemical science.

[9]  David E. Graff,et al.  Accelerating high-throughput virtual screening through molecular pool-based active learning , 2020, Chemical science.

[10]  Benjamin A. Shoemaker,et al.  PubChem in 2021: new data content and improved web interfaces , 2020, Nucleic Acids Res..

[11]  A. Bender,et al.  Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet. , 2020, Drug discovery today.

[12]  Dongsup Kim,et al.  Autonomous molecule generation using reinforcement learning and docking to develop potential novel inhibitors , 2020, Scientific Reports.

[13]  Minlie Huang,et al.  Reinforced Molecular Optimization with Neighborhood-Controlled Grammars , 2020, NeurIPS.

[14]  Al'an Aspuru-Guzik,et al.  Bayesian Variational Optimization for Combinatorial Spaces , 2020, ArXiv.

[15]  Yurii S Moroz,et al.  ZINC20 - A Free Ultralarge-Scale Chemical Database for Ligand Discovery , 2020, J. Chem. Inf. Model..

[16]  Alexandros Kalousis,et al.  Goal-directed Generation of Discrete Structures with Conditional Generative Models , 2020, NeurIPS.

[17]  Jinwoo Shin,et al.  Guiding Deep Molecular Optimization with Genetic Exploration , 2020, NeurIPS.

[18]  Stanislaw Jastrzebski,et al.  We should at least be able to Design Molecules that Dock Well , 2020, ArXiv.

[19]  Jos'e Miguel Hern'andez-Lobato,et al.  Sample-Efficient Optimization in the Latent Space of Deep Generative Models via Weighted Retraining , 2020, NeurIPS.

[20]  Artem Cherkasov,et al.  Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery , 2020, ACS central science.

[21]  O. Engkvist,et al.  REINVENT 2.0 – an AI Tool for De Novo Drug Design , 2020 .

[22]  David A. Scott,et al.  An open-source drug discovery platform enables ultra-large virtual screens , 2020, Nature.

[23]  David E. Shaw,et al.  A deep-learning view of chemical space designed to facilitate drug discovery , 2020, J. Chem. Inf. Model..

[24]  J. Reymond,et al.  SMILES-based deep generative scaffold decorator for de-novo drug design , 2020, Journal of Cheminformatics.

[25]  Pascal Friederich,et al.  Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation , 2019, Mach. Learn. Sci. Technol..

[26]  Krzysztof Rataj,et al.  Mol-CycleGAN: a generative model for molecular optimization , 2019, Journal of Cheminformatics.

[27]  Alán Aspuru-Guzik,et al.  Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models , 2018, Frontiers in Pharmacology.

[28]  AkshatKumar Nigam,et al.  Curiosity in exploring chemical space: Intrinsic rewards for deep molecular reinforcement learning , 2020, ArXiv.

[29]  Kam‐Heung Sze,et al.  Machine‐learning scoring functions for structure‐based virtual screening , 2020, WIREs Computational Molecular Science.

[30]  Xiaomin Luo,et al.  Pushing the boundaries of molecular representation for drug discovery with graph attention mechanism. , 2020, Journal of medicinal chemistry.

[31]  Connor W. Coley,et al.  Autonomous discovery in the chemical sciences part II: Outlook , 2020, Angewandte Chemie.

[32]  T. Jaakkola,et al.  Hierarchical Graph-to-Graph Translation for Molecules , 2019 .

[33]  Bin Li,et al.  Applications of machine learning in drug discovery and development , 2019, Nature Reviews Drug Discovery.

[34]  Viktor Hornak,et al.  Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening , 2019, PloS one.

[35]  Matthias Rarey,et al.  In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening , 2019, J. Chem. Inf. Model..

[36]  Yurii S. Moroz,et al.  Ultra-large library docking for discovering new chemotypes , 2019, Nature.

[37]  Yan Li,et al.  Comparative Assessment of Scoring Functions: The CASF-2016 Update , 2018, J. Chem. Inf. Model..

[38]  Marwin H. S. Segler,et al.  GuacaMol: Benchmarking Models for De Novo Molecular Design , 2018, J. Chem. Inf. Model..

[39]  Andrew R. Leach,et al.  ChEMBL: towards direct deposition of bioassay data , 2018, Nucleic Acids Res..

[40]  Jan H. Jensen,et al.  A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space , 2018, Chemical science.

[41]  Regina Barzilay,et al.  Learning Multimodal Graph-to-Graph Translation for Molecular Optimization , 2018, ICLR.

[42]  Niloy Ganguly,et al.  NeVAE: A Deep Generative Model for Molecular Graphs , 2018, AAAI.

[43]  Andrew Gordon Wilson,et al.  GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration , 2018, NeurIPS.

[44]  Jure Leskovec,et al.  Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation , 2018, NeurIPS.

[45]  M. Edwards,et al.  Lipophilic Efficiency as an Important Metric in Drug Design. , 2018, Journal of medicinal chemistry.

[46]  N. Gray,et al.  Kinase inhibitors: the road ahead , 2018, Nature Reviews Drug Discovery.

[47]  Anat Levit,et al.  STRUCTURE OF THE D2 DOPAMINE RECEPTOR BOUND TO THE ATYPICAL ANTIPSYCHOTIC DRUG RISPERIDONE , 2018, Nature.

[48]  Weijun Wang,et al.  Protein-Ligand Empirical Interaction Components for Virtual Screening , 2017, J. Chem. Inf. Model..

[49]  Lars Carlsson,et al.  Erratum to: ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics , 2017, Journal of Cheminformatics.

[50]  Thomas Blaschke,et al.  Molecular de-novo design through deep reinforcement learning , 2017, Journal of Cheminformatics.

[51]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[52]  Lars Carlsson,et al.  ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics , 2017, Journal of Cheminformatics.

[53]  Vijay S. Pande,et al.  MoleculeNet: a benchmark for molecular machine learning , 2017, Chemical science.

[54]  David Ryan Koes,et al.  Protein-Ligand Scoring with Convolutional Neural Networks , 2016, Journal of chemical information and modeling.

[55]  J. Tuszynski,et al.  Software for molecular docking: a review , 2017, Biophysical Reviews.

[56]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[57]  Jack W Scannell,et al.  When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis , 2016, PloS one.

[58]  R. Roskoski Classification of small molecule protein kinase inhibitors based upon the structures of their drug-enzyme complexes. , 2016, Pharmacological research.

[59]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[60]  Sereina Riniker,et al.  Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation , 2015, J. Chem. Inf. Model..

[61]  Izhar Wallach,et al.  AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery , 2015, ArXiv.

[62]  Antonio Lavecchia,et al.  Machine-learning approaches in drug discovery: methods and applications. , 2015, Drug discovery today.

[63]  A. Hopkins,et al.  The role of ligand efficiency metrics in drug discovery , 2014, Nature Reviews Drug Discovery.

[64]  Iskander Yusof,et al.  Considering the impact drug-like properties have on the chance of success. , 2013, Drug discovery today.

[65]  Daniel Rauh,et al.  De Novo Design of Protein Kinase Inhibitors by in Silico Identification of Hinge Region-Binding Fragments , 2013, ACS chemical biology.

[66]  Zhizhou Fang,et al.  Strategies for the selective regulation of kinases with allosteric modulators: exploiting exclusive structural features. , 2013, ACS chemical biology.

[67]  A. Joerger,et al.  Principles and applications of halogen bonding in medicinal chemistry and chemical biology. , 2013, Journal of medicinal chemistry.

[68]  I. Ghosh,et al.  New directions in targeting protein kinases: focusing upon true allosteric and bivalent inhibitors. , 2012, Current pharmaceutical design.

[69]  Michael M. Mysinger,et al.  Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking , 2012, Journal of medicinal chemistry.

[70]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[71]  G. V. Paolini,et al.  Quantifying the chemical beauty of drugs. , 2012, Nature chemistry.

[72]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[73]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[74]  Jan H. Jensen,et al.  PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical pKa Predictions. , 2011, Journal of chemical theory and computation.

[75]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[76]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[77]  Arthur J. Olson,et al.  AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading , 2009, J. Comput. Chem..

[78]  David S. Goodsell,et al.  AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility , 2009, J. Comput. Chem..

[79]  Thomas E. Exner,et al.  Influence of Protonation, Tautomeric, and Stereoisomeric States on Protein-Ligand Docking Results , 2009, J. Chem. Inf. Model..

[80]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[81]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[82]  Thomas J. Crisman,et al.  Which aspects of HTS are empirically correlated with downstream success? , 2008, Current opinion in drug discovery & development.

[83]  Gisbert Schneider,et al.  Virtual Screening for Bioactive Molecules: Böhm/Virtual , 2008 .

[84]  James E. Ferrell,et al.  Mechanisms of specificity in protein phosphorylation , 2007, Nature Reviews Molecular Cell Biology.

[85]  David G. Lloyd,et al.  Unbiasing Scoring Functions: A New Normalization and Rescoring Strategy , 2007, J. Chem. Inf. Model..

[86]  J. Newcomb,et al.  Discovery of novel 2,3-diarylfuro[2,3-b]pyridin-4-amines as potent and selective inhibitors of Lck: synthesis, SAR, and pharmacokinetic properties. , 2007, Bioorganic & medicinal chemistry letters.

[87]  G. Klebe Virtual ligand screening: strategies, perspectives and limitations , 2006, Drug Discovery Today.

[88]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[89]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[90]  B. Staels,et al.  Therapeutic roles of peroxisome proliferator-activated receptor agonists. , 2005, Diabetes.

[91]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[92]  J. Bajorath,et al.  Docking and scoring in virtual screening for drug discovery: methods and applications , 2004, Nature Reviews Drug Discovery.

[93]  Junmei Wang,et al.  Development and testing of a general amber force field , 2004, J. Comput. Chem..

[94]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[95]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[96]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[97]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[98]  G. Bemis,et al.  The properties of known drugs. 1. Molecular frameworks. , 1996, Journal of medicinal chemistry.

[99]  T. Halgren Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94 , 1996, J. Comput. Chem..

[100]  W. Guida,et al.  The art and practice of structure‐based drug design: A molecular modeling perspective , 1996, Medicinal research reviews.

[101]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[102]  J. Gasteiger,et al.  ITERATIVE PARTIAL EQUALIZATION OF ORBITAL ELECTRONEGATIVITY – A RAPID ACCESS TO ATOMIC CHARGES , 1980 .