Dealing with the Data Deluge: Handling the Multitude of Chemical Biology Data Sources

Over the last 20 years, there has been an explosion in the amount and type of biological and chemical data that has been made publicly available in a variety of online databases. While this means that vast amounts of information can be found online, there is no guarantee that it can be found easily (or at all). A scientist searching for a specific piece of information is faced with a daunting task—many databases have overlapping content, use their own identifiers and, in some cases, have arcane and unintuitive user interfaces. In this overview, a variety of well‐known data sources for chemical and biological information are highlighted, focusing on those most useful for chemical biology research. The issue of using data from multiple sources and the associated problems such as identifier disambiguation are highlighted. A brief discussion is then provided on Tripod, a recently developed platform that supports the integration of arbitrary data sources, providing users a simple interface to search across a federated collection of resources. Curr. Protoc. Chem. Biol. 4:193‐209 © 2012 by John Wiley & Sons, Inc.

[1]  Tudor I. Oprea,et al.  Systems Chemical Biology , 2019, Methods in Molecular Biology.

[2]  W. Kibbe,et al.  Annotating the human genome with Disease Ontology , 2009, BMC Genomics.

[3]  Joshua M. Stuart,et al.  Integrating genotype and phenotype information: an overview of the PharmGKB project , 2001, The Pharmacogenomics Journal.

[4]  Juliane Fluck,et al.  The Autoimmune Disease Database: a dynamically compiled literature-derived database , 2006, BMC Bioinformatics.

[5]  Christoph Steinbeck,et al.  NMRShiftDB -- compound identification and structure elucidation support through a free community-built web database. , 2004, Phytochemistry.

[6]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[7]  Egon L. Willighagen,et al.  Linked open drug data for pharmaceutical research and development , 2011, J. Cheminformatics.

[8]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[9]  Stephan Philippi Data and knowledge integration in the life sciences , 2008, Briefings Bioinform..

[10]  M. Fidock,et al.  Maximizing serendipity: strategies for identifying ligands for orphan G-protein-coupled receptors. , 2003, Current opinion in pharmacology.

[11]  N. Null The IUPAC International Chemical Identifier (InChI) , 2009 .

[12]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[13]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[14]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[15]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[16]  Damian Szklarczyk,et al.  STITCH 2: an interaction network database for small molecules and proteins , 2009, Nucleic Acids Res..

[17]  Jignesh M. Patel,et al.  Michigan molecular interactions r2: from interacting proteins to pathways , 2008, Nucleic Acids Res..

[18]  Russ B Altman,et al.  PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base. , 2005, Methods in molecular biology.

[19]  Bissan Al-Lazikani,et al.  canSAR: an integrated cancer public translational research and drug discovery resource , 2011, Nucleic Acids Res..

[20]  Thomas A. Ban,et al.  The role of serendipity in drug discovery , 2006, Dialogues in clinical neuroscience.

[21]  Herbert Waldmann,et al.  Protein structure similarity clustering (PSSC) and natural product structure as inspiration sources for drug development and chemical genomics. , 2005, Current opinion in chemical biology.

[22]  Yike Guo,et al.  Consistency, comprehensiveness, and compatibility of pathway databases , 2010, BMC Bioinformatics.

[23]  Satoru Kuhara,et al.  Original Article , 2005 .

[24]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[25]  Ruili Huang,et al.  The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics , 2011, Science Translational Medicine.

[26]  R. Altman,et al.  PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base. , 2005, Methods in molecular biology.

[27]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[28]  Allen B Reitz,et al.  Hit triage using efficiency indices after screening of compound libraries in drug discovery. , 2009, Current topics in medicinal chemistry.

[29]  Andrew I Su,et al.  Power-law-like distributions in biomedical publications and research funding , 2007, Genome Biology.

[30]  John Kinney,et al.  Comparative Study of Machine-Learning and Chemometric Tools for Analysis of In-Vivo High-Throughput Screening Data , 2008, J. Chem. Inf. Model..

[31]  Bernd Beck,et al.  A Composite Model for hERG Blockade , 2008, ChemMedChem.

[32]  P. Hajduk,et al.  Rational approaches to targeted polypharmacology: creating and navigating protein-ligand interaction networks. , 2010, Current opinion in chemical biology.

[33]  Bin Chen,et al.  Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data , 2010, BMC Bioinformatics.

[34]  Todd F. DeLuca,et al.  Genotator: A disease-agnostic tool for genetic annotation of disease , 2010, BMC Medical Genomics.

[35]  Renxiao Wang,et al.  The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. , 2004, Journal of medicinal chemistry.

[36]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[37]  Hongmao Sun,et al.  An Accurate and Interpretable Bayesian Classification Model for Prediction of hERG Liability , 2006, ChemMedChem.

[38]  Avi Ma'ayan,et al.  ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments , 2010, Bioinform..

[39]  James E. J. Mills,et al.  Enhanced HTS Hit Selection via a Local Hit Rate Analysis , 2009, J. Chem. Inf. Model..

[40]  Jacob Köhler,et al.  Addressing the problems with life-science databases for traditional uses and systems biology , 2006, Nature Reviews Genetics.

[41]  Patrick R. Griffin,et al.  PubChem promiscuity: a web resource for gathering compound promiscuity data from PubChem , 2012, Bioinform..

[42]  Svetlana Bureeva,et al.  Network and pathway analysis of compound-protein interactions. , 2009, Methods in molecular biology.

[43]  P. Bork,et al.  A side effect resource to capture phenotypic effects of drugs , 2010, Molecular systems biology.

[44]  Xiaomin Luo,et al.  PDTD: a web-accessible protein database for drug target identification , 2008, BMC Bioinformatics.

[45]  Wendy A. Warr,et al.  Representation of chemical structures , 2011 .

[46]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[47]  Jing Li,et al.  Novel Statistical Approach for Primary High-Throughput Screening Hit Selection , 2005, J. Chem. Inf. Model..