Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction

Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-contaminated with proteins and ligands with high similarity, and hence they may not perform comparably well in binding prediction of new protein-ligand complexes. In this work we have carefully prepared a cleaned PDBBind data set of non-covalent binders that are split into training, validation, and test datasets to control for data leakage. The resulting leak-proof (LP)-PDBBind data is used to retrain four popular SFs: AutoDock vina, Random Forest (RF)-Score, InteractionGraphNet (IGN), and DeepDTA, to better test their capabilities when applied to new protein-ligand complexes. In particular we have formulated a new independent data set, BDB2020+, by matching high quality binding free energies from BindingDB with co-crystalized ligand-protein complexes from the PDB that have been deposited since 2020. Based on all the benchmark results, the retrained models using LP-PDBBind that rely on 3D information perform consistently among the best, with IGN especially being recommended for scoring and ranking applications for new protein-ligand systems.

[1]  Michael R. Shirts,et al.  Development and Benchmarking of Open Force Field 2.0.0: The Sage Small Molecule Force Field , 2023, Journal of chemical theory and computation.

[2]  Guolin Ke,et al.  Do Deep Learning Models Really Outperform Traditional Approaches in Molecular Docking? , 2023, ArXiv.

[3]  Vineet D. Menachery,et al.  Development of Highly Potent Noncovalent Inhibitors of SARS-CoV-2 3CLpro , 2023, ACS central science.

[4]  Yingkai Zhang,et al.  CovBinderInPDB: A Structure-Based Covalent Binder Database , 2022, J. Chem. Inf. Model..

[5]  Chengtao Li,et al.  TANKBind: Trigonometry-Aware Neural NetworKs for Drug-Protein Binding Structure Prediction , 2022, bioRxiv.

[6]  Dongwei Kang,et al.  Discovery and Crystallographic Studies of Trisubstituted Piperazine Derivatives as Non-Covalent SARS-CoV-2 Main Protease Inhibitors with High Target Specificity and Low Toxicity , 2022, Journal of medicinal chemistry.

[7]  P. Pan,et al.  Boosting Protein-Ligand Binding Pose Prediction and Virtual Screening Based on Residue-Atom Distance Likelihood Potential and Graph Transformer. , 2022, Journal of medicinal chemistry.

[8]  Arnold N. Tharrington,et al.  Hit Expansion of a Noncovalent SARS-CoV-2 Main Protease Inhibitor , 2022, ACS pharmacology & translational science.

[9]  Vignesh Ram Somnath,et al.  Multi-Scale Representation Learning on Proteins , 2022, NeurIPS.

[10]  Y. Orba,et al.  Discovery of S-217622, a Noncovalent Oral SARS-CoV-2 3CL Protease Inhibitor Clinical Candidate for Treating COVID-19 , 2022, Journal of medicinal chemistry.

[11]  T. Jaakkola,et al.  EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction , 2022, ICML.

[12]  Jike Wang,et al.  InteractionGraphNet: A Novel and Efficient Deep Graph Representation Learning Framework for Accurate Protein-Ligand Interaction Predictions. , 2021, Journal of medicinal chemistry.

[13]  G. Wei,et al.  Perspectives on SARS-CoV-2 Main Protease Inhibitors. , 2021, Journal of medicinal chemistry.

[14]  J. Louis,et al.  Structural, Electronic, and Electrostatic Determinants for Inhibitor Binding to Subsites S1 and S2 in SARS-CoV-2 Main Protease , 2021, Journal of medicinal chemistry.

[15]  Alice Hooper,et al.  Structure-Based Optimization of ML300-Derived, Noncovalent Inhibitors Targeting the Severe Acute Respiratory Syndrome Coronavirus 3CL Protease (SARS-CoV-2 3CLpro) , 2021, Journal of medicinal chemistry.

[16]  M. Vignuzzi,et al.  Masitinib is a broad coronavirus 3CL inhibitor that blocks replication of SARS-CoV-2 , 2021, Science.

[17]  Scott J. Miller,et al.  Structure-guided design of a perampanel-derived pharmacophore targeting the SARS-CoV-2 main protease , 2021, Structure.

[18]  Diogo Santos-Martins,et al.  AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings , 2021, J. Chem. Inf. Model..

[19]  S. Cherry,et al.  Expedited Approach toward the Rational Design of Noncovalent SARS-CoV-2 Main Protease Inhibitors , 2021, Journal of medicinal chemistry.

[20]  Diana C. F. Monteiro,et al.  X-ray screening identifies active site and allosteric inhibitors of SARS-CoV-2 main protease , 2021, Science.

[21]  Farren J. Isaacs,et al.  Potent Noncovalent Inhibitors of the Main Protease of SARS-CoV-2 from Molecular Sculpting of the Drug Perampanel Guided by Free Energy Perturbation Calculations , 2021, ACS central science.

[22]  C. Schiffer,et al.  Crystal Structure of SARS-CoV-2 Main Protease in Complex with the Non-Covalent Inhibitor ML188 , 2021, Viruses.

[23]  Dan Li,et al.  Accuracy or novelty: what can we gain from target-specific machine-learning-based scoring functions in virtual screening? , 2021, Briefings Bioinform..

[24]  Z. Xi,et al.  New Drugs, Old Targets: Tweaking the Dopamine System to Treat Psychostimulant Use Disorders. , 2021, Annual review of pharmacology and toxicology.

[25]  Michael R. Shirts,et al.  Development and Benchmarking of Open Force Field v1.0.0-the Parsley Small-Molecule Force Field. , 2020, Journal of chemical theory and computation.

[26]  Jaechang Lim,et al.  PIGNet: a physics-informed deep learning model toward generalized drug–target interaction predictions , 2020, Chemical science.

[27]  Kunqian Yu,et al.  Anti-SARS-CoV-2 activities in vitro of Shuanghuanglian preparations and bioactive ingredients , 2020, Acta Pharmacologica Sinica.

[28]  Farnaz Heidar-Zadeh,et al.  Learning to Make Chemical Predictions: the Interplay of Feature Representation, Data, and Machine Learning Methods. , 2020, Chem.

[29]  Holger Gohlke,et al.  Converging a Knowledge-Based Scoring Function: DrugScore2018 , 2018, J. Chem. Inf. Model..

[30]  Yan Li,et al.  Comparative Assessment of Scoring Functions: The CASF-2016 Update , 2018, J. Chem. Inf. Model..

[31]  P. Sanseau,et al.  Drug repurposing: progress, challenges and recommendations , 2018, Nature Reviews Drug Discovery.

[32]  Kwong-Sak Leung,et al.  The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction , 2018, Biomolecules.

[33]  Arzucan Özgür,et al.  DeepDTA: deep drug–target binding affinity prediction , 2018, Bioinform..

[34]  Sheng-yong,et al.  Structural insights into drug development strategy targeting EGFR T790M/C797S , 2018, Oncotarget.

[35]  C. Yun,et al.  Structural basis of mutant-selectivity and drug-resistance related to CO-1686. , 2017, Oncotarget.

[36]  Yang Li,et al.  Structural and Sequence Similarity Makes a Significant Impact on Machine-Learning-Based Scoring Functions for Protein-Ligand Interactions , 2017, J. Chem. Inf. Model..

[37]  C. Eigenbrot,et al.  4-Aminoindazolyl-dihydrofuro[3,4-d]pyrimidines as non-covalent inhibitors of mutant epidermal growth factor receptor tyrosine kinase. , 2016, Bioorganic & medicinal chemistry letters.

[38]  C. Eigenbrot,et al.  Pyridones as Highly Selective, Noncovalent Inhibitors of T790M Double Mutants of EGFR. , 2016, ACS medicinal chemistry letters.

[39]  J. Smaill,et al.  Binding mode of the breakthrough inhibitor AZD9291 to epidermal growth factor receptor revealed. , 2015, Journal of structural biology.

[40]  Michael K. Gilson,et al.  BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology , 2015, Nucleic Acids Res..

[41]  Jie Li,et al.  PDB-wide collection of binding data: current status of the PDBbind database , 2015, Bioinform..

[42]  Vijay S Pande,et al.  Building Force Fields: An Automatic, Systematic, and Reproducible Approach. , 2014, The journal of physical chemistry letters.

[43]  P. Lyu,et al.  Protein kinase inhibitor design by targeting the Asp-Phe-Gly (DFG) motif: the role of the DFG motif in the design of epidermal growth factor receptor inhibitors. , 2013, Journal of medicinal chemistry.

[44]  Yoshikazu Ohta,et al.  Structure-Based Approach for the Discovery of Pyrrolo[3,2-d]pyrimidine-Based EGFR T790M/L858R Mutant Inhibitors. , 2013, ACS medicinal chemistry letters.

[45]  C. Lindsley,et al.  Discovery, synthesis, and structure-based optimization of a series of N-(tert-butyl)-2-(N-arylamido)-2-(pyridin-3-yl) acetamides (ML188) as potent noncovalent small molecule inhibitors of the severe acute respiratory syndrome coronavirus (SARS-CoV) 3CL protease. , 2013, Journal of medicinal chemistry.

[46]  A. Luxen,et al.  Development of New Drugs for an Old Target — The Penicillin Binding Proteins , 2012, Molecules.

[47]  G. V. Paolini,et al.  Quantifying the chemical beauty of drugs. , 2012, Nature chemistry.

[48]  Xiaoqin Zou,et al.  Scoring functions and their evaluation methods for protein-ligand docking: recent advances and future directions. , 2010, Physical chemistry chemical physics : PCCP.

[49]  John B. O. Mitchell,et al.  A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking , 2010, Bioinform..

[50]  David S. Goodsell,et al.  AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility , 2009, J. Comput. Chem..

[51]  Arthur J. Olson,et al.  AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading , 2009, J. Comput. Chem..

[52]  Zhihai Liu,et al.  Comparative Assessment of Scoring Functions on a Diverse Test Set , 2009, J. Chem. Inf. Model..

[53]  Mary Adams,et al.  Discovery of novel 4-amino-6-arylaminopyrimidine-5-carbaldehyde oximes as dual inhibitors of EGFR and ErbB-2 protein tyrosine kinases. , 2008, Bioorganic & medicinal chemistry letters.

[54]  M. Meyerson,et al.  The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP , 2008, Proceedings of the National Academy of Sciences.

[55]  Matthew Meyerson,et al.  Structures of lung cancer-derived EGFR mutants and inhibitor complexes: mechanism of activation and insights into differential inhibitor sensitivity. , 2007, Cancer cell.

[56]  Xin Wen,et al.  BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities , 2006, Nucleic Acids Res..

[57]  M. Jacobson,et al.  Molecular mechanics methods for predicting protein-ligand binding. , 2006, Physical chemistry chemical physics : PCCP.

[58]  I. Muegge PMF scoring revisited. , 2006, Journal of medicinal chemistry.

[59]  Matthew P. Repasky,et al.  Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. , 2006, Journal of medicinal chemistry.

[60]  Marcel L Verdonk,et al.  General and targeted statistical potentials for protein–ligand interactions , 2005, Proteins.

[61]  Krystal J Alligood,et al.  A Unique Structure for Epidermal Growth Factor Receptor Bound to GW572016 (Lapatinib) , 2004, Cancer Research.

[62]  Roy S Herbst,et al.  Review of epidermal growth factor receptor biology. , 2004, International journal of radiation oncology, biology, physics.

[63]  Hege S. Beard,et al.  Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. , 2004, Journal of medicinal chemistry.

[64]  M. Sliwkowski,et al.  Structure of the Epidermal Growth Factor Receptor Kinase Domain Alone and in Complex with a 4-Anilinoquinazoline Inhibitor* , 2002, The Journal of Biological Chemistry.

[65]  G. Klebe,et al.  Knowledge-based scoring function to predict protein-ligand interactions. , 2000, Journal of molecular biology.

[66]  P Willett,et al.  Development and validation of a genetic algorithm for flexible docking. , 1997, Journal of molecular biology.

[67]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[68]  E. Maréchal Measuring Bioactivity: KI, IC50 and EC50 , 2011 .

[69]  William L. Jorgensen,et al.  Journal of Chemical Information and Modeling , 2005, J. Chem. Inf. Model..

[70]  Luhua Lai,et al.  Further development and validation of empirical scoring functions for structure-based binding affinity prediction , 2002, J. Comput. Aided Mol. Des..

[71]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[72]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .