Cross‐Mapping of Protein – Ligand Binding Data Between ChEMBL and PDBbind

The ChEMBL database is a valuable open data source, which provides a comprehensive collection of binding data, functional and ADMET properties of bioactive compounds. The PDBbind database has a more focused scope, i.e. collecting binding data for the protein‐ligand complexes in the Protein Data Bank. Currently, the PDBbind collection of binding data is rather modest as compared to the ChEMBL collection (∼13 000 versus ∼1.3 million). One may suspect if the former is actually a subset of the latter. In this study, we mapped the molecular information and protein‐ligand binding data in PDBbind to the records in ChEMBL, and then analyzed the overlap between the binding data recorded in these two databases. Our results indicate that only ∼20 % of the binding data in PDBbind can find their counterparts in ChEMBL. Thus, the PDBbind collection of binding data is largely complementary to the ChEMBL collection. We also reveal two reasons accounting for the low overlap between two databases: First, only a minor fraction of the protein‐ligand complexes in PDBbind is covered by ChEMBL; Second, the literature spaces screened by these two databases do not have a substantial overlap either. The value of focused databases versus more comprehensive ones is demonstrated by our study.

[1]  Jie Li,et al.  Comparative Assessment of Scoring Functions on an Updated Benchmark: 1. Compilation of the Test Set , 2014, J. Chem. Inf. Model..

[2]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[3]  Zhihai Liu,et al.  Comparative Assessment of Scoring Functions on a Diverse Test Set , 2009, J. Chem. Inf. Model..

[4]  Renxiao Wang,et al.  The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. , 2004, Journal of medicinal chemistry.

[5]  Martin Serrano,et al.  Nucleic Acids Research Advance Access published October 18, 2007 ChemBank: a small-molecule screening and , 2007 .

[6]  Andreas Bender,et al.  Databases: Compound bioactivities go public , 2010 .

[7]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[8]  A. Vulpetti,et al.  The experimental uncertainty of heterogeneous public K(i) data. , 2012, Journal of medicinal chemistry.

[9]  David S. Wishart,et al.  DrugBank 4.0: shedding new light on drug metabolism , 2013, Nucleic Acids Res..

[10]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[11]  Jie Li,et al.  PDB-wide collection of binding data: current status of the PDBbind database , 2015, Bioinform..

[12]  Evan Bolton,et al.  PubChem's BioAssay Database , 2011, Nucleic Acids Res..

[13]  Yuan Zhao,et al.  Computation of Octanol-Water Partition Coefficients by Guiding an Additive Model with Knowledge , 2007, J. Chem. Inf. Model..

[14]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[15]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[16]  Xin Wen,et al.  BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities , 2006, Nucleic Acids Res..

[17]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[18]  Evan Bolton,et al.  An overview of the PubChem BioAssay resource , 2009, Nucleic Acids Res..

[19]  Yanli Wang,et al.  PubChem BioAssay: 2014 update , 2013, Nucleic Acids Res..

[20]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[21]  Renxiao Wang,et al.  The PDBbind database: methodologies and updates. , 2005, Journal of medicinal chemistry.

[22]  Xi Chen,et al.  The Binding Database: data management and interface design , 2002, Bioinform..