Complementarity between public and commercial databases: new opportunities in medicinal chemistry informatics.

The last two years have seen a dramatic expansion in public cheminformatics, as exemplified by the approximate five-fold growth of PubChem from over 50 contributing data sources. Consequently, medicinal chemists who were hitherto limited to commercial databases now also have access to public sources that they can download and/or query directly over the Web. The range of public sources, particularly where they link out to structured bioinformatic and biological data, already offer utilities that have no commercial equivalent. This work reviews compound content comparisons between selected public and commercial databases that capture bioactive content. We focused particularly on those that specify relationships between compounds and their protein targets. Our stringent filtering produced lower unique compound numbers than those reported for individual databases and thereby facilitated standardised comparisons of content. The resultant matrix shows the pairwise comparison of each database and selected subsets. Overall, this showed an unexpected degree of non-overlap, thereby emphasising the complementarity gained from combining public and commercial sources. This conclusion is supported by a Venn-type analysis of GVKBIO, WOMBAT (both commercial) and PubChem (public). These databases show not only overlap but also unique bioactive content in each case because of their different strategies for source selection and data collection.

[1]  Mark Watson,et al.  Optimizing the use of open-source software applications in drug discovery. , 2006, Drug discovery today.

[2]  Dimitris K. Agrafiotis,et al.  A Cluster-Based Strategy for Assessing the Overlap between Large Chemical Libraries and Its Application to a Recent Acquisition , 2006, J. Chem. Inf. Model..

[3]  R. Strausberg,et al.  From Knowing to Controlling: A Path from Genomics to Drugs Using Small Molecule Probes , 2003, Science.

[4]  E. Sausville,et al.  Mining the National Cancer Institute's tumor-screening database: identification of compounds with similar cellular activities. , 2002, Journal of medicinal chemistry.

[5]  Ramaswamy Nilakantan,et al.  Database diversity assessment: New ideas, concepts, and tools , 1997, J. Comput. Aided Mol. Des..

[6]  David Bradley Public molecules: small, but perfectly formed , 2004, Nature Reviews Drug Discovery.

[7]  Søren Brunak,et al.  Prediction methods and databases within chemoinformatics : Emphasis on drugs and drug candidates , 2005 .

[8]  D. Banville Mining chemical structural information from the drug literature. , 2006, Drug discovery today.

[9]  D. Wishart Bioinformatics in Drug Development and Assessment , 2005, Drug metabolism reviews.

[10]  Henry S. Rzepa,et al.  Chemistry in Bioinformatics , 2005, BMC Bioinformatics.

[11]  Andrey Rzhetsky,et al.  Imitating Manual Curation of Text-Mined Facts in Biomedicine , 2006, PLoS Comput. Biol..

[12]  Catherine Brooksbank,et al.  The European Bioinformatics Institute's data resources: towards systems biology , 2004, Nucleic Acids Res..

[13]  Chittibabu Guda,et al.  LMPD: LIPID MAPS proteome database , 2005, Nucleic Acids Res..

[14]  Dragos Horvath,et al.  Predicting ADME properties and side effects: the BioPrint approach. , 2003, Current opinion in drug discovery & development.

[15]  Christopher P Austin,et al.  Measure, mine, model, and manipulate: the future for HTS and chemoinformatics? , 2006, Drug discovery today.

[16]  Jens Sadowski,et al.  Structure Modification in Chemical Databases , 2005 .

[17]  T. N. Bhat,et al.  The Protein Data Bank: unifying the archive , 2002, Nucleic Acids Res..

[18]  Christopher W. V. Hogue,et al.  Domain-based small molecule binding site annotation , 2006, BMC Bioinformatics.

[19]  Xi Chen,et al.  The Binding Database: data management and interface design , 2002, Bioinform..

[20]  Allen C. Browne,et al.  Analysis of biomedical text for chemical names: a comparison of three methods , 1999, AMIA.

[21]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[22]  T. Insel,et al.  NIH Molecular Libraries Initiative , 2004, Science.

[23]  Tudor I. Oprea,et al.  WOMBAT and WOMBAT‐PK: Bioactivity Databases for Lead and Drug Discovery , 2008 .

[24]  Alban Arrault,et al.  Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers , 2006, Molecular Diversity.

[25]  Robert P. Sheridan,et al.  Calculating Similarities between Biological Activities in the MDL Drug Data Report Database , 2004, J. Chem. Inf. Model..

[26]  Tudor I. Oprea Chemoinformatics in Lead Discovery , 2005 .

[27]  Kiyoko F. Aoki-Kinoshita,et al.  KEGG as a glycome informatics resource. , 2006, Glycobiology.

[28]  Egon L. Willighagen,et al.  The Blue Obelisk—Interoperability in Chemical Informatics , 2006, J. Chem. Inf. Model..

[29]  Ann M Richard,et al.  Chemical structure indexing of toxicity data on the internet: moving toward a flat world. , 2006, Current opinion in drug discovery & development.

[30]  Jens Sadowski,et al.  "In-House Likeness": Comparison of Large Compound Collections Using Artificial Neural Networks , 2005, J. Chem. Inf. Model..

[31]  G. V. Paolini,et al.  Global mapping of pharmacological space , 2006, Nature Biotechnology.

[32]  Monya Baker,et al.  Open-access chemistry databases evolving slowly but not surely , 2006, Nature Reviews Drug Discovery.