Data integration: challenges for drug discovery

The effective integration of data and knowledge from many disparate sources will be crucial to future drug discovery. Data integration is a key element of conducting scientific investigations with modern platform technologies, managing increasingly complex discovery portfolios and processes, and fully realizing economies of scale in large enterprises. However, viewing data integration as simply an 'IT problem' underestimates the novel and serious scientific and management challenges it embodies — challenges that could require significant methodological and even cultural changes in our approach to data.

[1]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[2]  哲也 大久保 The Information Bank社の情報バンク , 1979 .

[3]  Dieter Jungnickel,et al.  Graphs, Networks, and Algorithms , 1980 .

[4]  Jack Minker,et al.  On Indefinite Databases and the Closed World Assumption , 1987, CADE.

[5]  Carlo Zaniolo,et al.  Database relations with null values , 1982, J. Comput. Syst. Sci..

[6]  J Siemiatycki,et al.  The problem of multiple inference in studies designed to generate hypotheses. , 1985, American journal of epidemiology.

[7]  Kenneth J. Jr. Konyndyk,et al.  Introductory Modal Logic , 1986 .

[8]  Edward R. Tufte,et al.  The Visual Display of Quantitative Information , 1986 .

[9]  W.A. Woods,et al.  Important issues in knowledge representation , 1986, Proceedings of the IEEE.

[10]  P. Fayers,et al.  The Visual Display of Quantitative Information , 1990 .

[11]  C. Rieder,et al.  Greatwall kinase , 2004, The Journal of cell biology.

[12]  D. Schum The Evidential Foundations of Probabilistic Reasoning , 1994 .

[13]  D J Rothwell SNOMED-based knowledge representation. , 1995, Methods of information in medicine.

[14]  J A Blom Temporal logics and real time expert systems. , 1996, Computer methods and programs in biomedicine.

[15]  George Davey Smith,et al.  Meta-analysis: Principles and procedures , 1997, BMJ.

[16]  A. Valencia,et al.  Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[17]  P M Nadkarni,et al.  QAV: querying entity-attribute-value metadata in a biomedical database. , 1997, Computer methods and programs in biomedicine.

[18]  Efraim Turban,et al.  Decision support systems and intelligent systems , 1997 .

[19]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[20]  T. Perneger What's wrong with Bonferroni adjustments , 1998, BMJ.

[21]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[22]  Susan Budavari,et al.  The Merck index , 1998 .

[23]  TAMBIS--Transparent Access to Multiple Bioinformatics Information Sources. , 1998, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[24]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Tony Reichhardt,et al.  NASA reworks its sums after Mars fiasco , 1999, Nature.

[26]  R Bender,et al.  Multiple test procedures other than Bonferroni's deserve wider use , 1999, BMJ.

[27]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[28]  Philip Ball Lessons in molecular gymnastics , 1999, Nature.

[29]  John C. Wooley Trends in computational biology. , 1999, RECOMB 1999.

[30]  T Reichhardt,et al.  It's sink or swim as a tidal wave of data approaches , 1999, Nature.

[31]  John C. Wooley Trends in Computational Biology: A Summary Based on a RECOMB Plenary Lecture, 1999 , 1999, J. Comput. Biol..

[32]  Perry L. Miller,et al.  Application of Information Technology: Organization of Heterogeneous Scientific Data Using the EAV/CR Representation , 1999, J. Am. Medical Informatics Assoc..

[33]  Allen C. Browne,et al.  Analysis of biomedical text for chemical names: a comparison of three methods , 1999, AMIA.

[34]  M. Palmer,et al.  Data diving with cross‐validation: an investigation of broad‐scale gradients in Swedish weed communities , 1999 .

[35]  Philip Ball The speed of computers , 1999, Nature.

[36]  S. Dudoit,et al.  Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. , 2000, Genome research.

[37]  Isaac S. Kohane,et al.  Bioinformatics and Clinical Informatics: The Imperative to Collaborate , 2000, J. Am. Medical Informatics Assoc..

[38]  A. Poustka,et al.  Systematic subcellular localization of novel proteins identified by large‐scale cDNA sequencing , 2000, EMBO reports.

[39]  Trey Ideker,et al.  Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data , 2000, J. Comput. Biol..

[40]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[41]  J Zeng,et al.  Mini-review: computational structure-based design of inhibitors that target protein surfaces. , 2000, Combinatorial chemistry & high throughput screening.

[42]  Carole A. Goble,et al.  Ontology-based Knowledge Representation for Bioinformatics , 2000, Briefings Bioinform..

[43]  P. Boffetta Molecular epidemiology. , 2000, Journal of internal medicine.

[44]  A. K. Pujari,et al.  Data Mining Techniques , 2006 .

[45]  John D. Potter,et al.  At the interfaces of epidemiology, genetics and genomics , 2001, Nature Reviews Genetics.

[46]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[47]  D B Searls,et al.  Mining the bibliome , 2001, The Pharmacogenomics Journal.

[48]  W. Kaminsky Much ado about data , 2001, Nature Medicine.

[49]  J. L. Stanton,et al.  Meta-analysis of gene expression in mouse preimplantation embryo development. , 2001, Molecular human reproduction.

[50]  Gary D Bader,et al.  A Combined Experimental and Computational Strategy to Define Protein Interaction Networks for Peptide Recognition Modules , 2001, Science.

[51]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[52]  M. Parascandola,et al.  Causation in epidemiology , 2001, Journal of epidemiology and community health.

[53]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[54]  Steffen Schulze-Kremer,et al.  The Semantic Metadatabase (SEMEDA): Ontology Based Integration of Federated Molecular Biological Data Sources , 2001, Silico Biol..

[55]  Pierre R. Bushel,et al.  Assessing Gene Significance from cDNA Microarray Expression Data via Mixed Models , 2001, J. Comput. Biol..

[56]  G. Pertea,et al.  RESOURCERER: a database for annotating and linking microarray resources within and across species , 2001, Genome Biology.

[57]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[58]  Terry A. Halpin,et al.  Information Modelling and Relational Databases , 2001 .

[59]  W. Wasserman,et al.  GeneLynx: a gene-centric portal to the human genome. , 2001, Genome research.

[60]  D. Weed Environmental epidemiology: basics and proof of cause-effect. , 2002, Toxicology.

[61]  Martin Steffen,et al.  Automated modelling of signal transduction networks , 2002, BMC Bioinformatics.

[62]  T. Barrette,et al.  Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. , 2002, Cancer research.

[63]  Lucila Ohno-Machado,et al.  Analysis of matched mRNA measurements from two different microarray technologies , 2002, Bioinform..

[64]  L. Wong,et al.  Technologies for Integrating Biological Data , 2002, Briefings Bioinform..

[65]  Perry L. Miller,et al.  Metadata-driven creation of data marts from an EAV-modeled clinical research database , 2002, Int. J. Medical Informatics.

[66]  J. Weinstein,et al.  Pharmacogenomic analysis: correlating molecular substructure classes with microarray gene expression data , 2002, The Pharmacogenomics Journal.

[67]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[68]  T. Venkatesh,et al.  Integromics: challenges in data integration , 2002, Genome Biology.

[69]  May D. Wang,et al.  GoMiner: a resource for biological interpretation of genomic and proteomic data , 2003, Genome Biology.

[70]  G. Churchill Fundamentals of experimental design for cDNA microarrays , 2002, Nature Genetics.

[71]  Steven C. Lawlor,et al.  GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways , 2002, Nature Genetics.

[72]  Mary Shultz,et al.  Variations in Medical Subject Headings (MeSH) mapping: from the natural language of patron terms to the controlled vocabulary of mapped lists. , 2002, Journal of the Medical Library Association : JMLA.

[73]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[74]  Mario Pazzagli,et al.  Quantitative real-time reverse transcription polymerase chain reaction: normalization to rRNA or single housekeeping genes is inappropriate for human tissue biopsies. , 2002, Analytical biochemistry.

[75]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[76]  V. Maojo,et al.  Integrating Genomics into Health Information Systems , 2002, Methods of Information in Medicine.

[77]  P. Lewi,et al.  Protein–protein interactions: mechanisms and modification by drugs , 2002, Journal of molecular recognition : JMR.

[78]  Gary D Bader,et al.  Analyzing yeast protein–protein interaction data obtained from different sources , 2002, Nature Biotechnology.

[79]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[80]  J. Hodgson Reconstructing pharmaceutical instinct , 2002, Nature Biotechnology.

[81]  G. Pertea,et al.  Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). , 2002, Genome research.

[82]  James V. Stone Independent component analysis: an introduction , 2002, Trends in Cognitive Sciences.

[83]  J. Weinstein 'Omic' and hypothesis-driven research in the molecular pharmacology of cancer. , 2002, Current opinion in pharmacology.

[84]  N. Tsuji,et al.  Selection of an internal control gene for quantitation of mRNA in colonic tissues. , 2002, Anticancer research.

[85]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[86]  Stefan Wuchty,et al.  Interaction and domain networks of yeast , 2002, Proteomics.

[87]  S. Ebrahim,et al.  Data dredging, bias, or confounding , 2002, BMJ : British Medical Journal.

[88]  Sue Povey,et al.  Genew: the Human Gene Nomenclature Database , 2002, Nucleic Acids Res..

[89]  S. Schuster,et al.  Metabolic network structure determines key aspects of functionality and regulation , 2002, Nature.

[90]  Jason E. Stewart,et al.  Design and implementation of microarray gene expression markup language (MAGE-ML) , 2002, Genome Biology.

[91]  Michael Ashburner,et al.  On ontologies for biologists: the Gene Ontology--untangling the web. , 2002, Novartis Foundation symposium.

[92]  Alan L. Rector,et al.  Scale and context: issues in ontologies to link health- and bio-informatics , 2002, AMIA.

[93]  Hong Yu,et al.  Automatically identifying gene/protein terms in MEDLINE abstracts , 2002, J. Biomed. Informatics.

[94]  Neil R Smalheiser Informatics and hypothesis‐driven research , 2002, EMBO reports.

[95]  Tsviya Olender,et al.  GeneCardsTM 2002: towards a complete, object-oriented, human gene compendium , 2002, Bioinform..

[96]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[97]  Yoav Benjamini,et al.  Identifying differentially expressed genes using false discovery rate controlling procedures , 2003, Bioinform..

[98]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[99]  David B. Searls,et al.  Data integration—connecting the dots , 2003, Nature Biotechnology.

[100]  D. L. Taylor,et al.  Advances in high content screening for drug discovery. , 2003, Assay and drug development technologies.

[101]  Dilip Rajagopalan,et al.  A comparison of statistical methods for analysis of high density oligonucleotide array data , 2003, Bioinform..

[102]  M. Vidal,et al.  Integrating 'omic' information: a bridge between genomics and systems biology. , 2003, Trends in genetics : TIG.

[103]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[104]  Hans Hagen,et al.  Hierarchical and Geometrical Methods in Scientific Visualization , 2003 .

[105]  Alan Ruttenberg,et al.  Computational knowledge integration in biopharmaceutical research , 2003, Briefings Bioinform..

[106]  Matthias Lange,et al.  SEMEDA: ontology based semantic integration of biological databases , 2003, Bioinform..

[107]  Kathleen Marchal,et al.  INCLUSive: a web portal and service registry for microarray and regulatory sequence analysis , 2003, Nucleic Acids Res..

[108]  Michael Gribskov,et al.  2HAPI: A Microarray Data Analysis System , 2003, Bioinform..

[109]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[110]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[111]  Akihiko Konagaya,et al.  GSCope: a clipped fisheye viewer effective for highly complicated biomolecular network graphs , 2003, Bioinform..

[112]  Frederick P. Roth,et al.  Predicting co-complexed protein pairs using genomic and proteomic data integration , 2004, BMC Bioinformatics.

[113]  L. Stein Integrating biological databases , 2003, Nature Reviews Genetics.

[114]  Tiffani J. Bright,et al.  PubMatrix: a tool for multiplex literature mining , 2003, BMC Bioinformatics.

[115]  Christophe Antoniewski,et al.  Absence of transitive and systemic pathways allows cell-specific and isoform-specific RNAi in Drosophila. , 2003, RNA.

[116]  Mark S. Boguski,et al.  Biomedical informatics for proteomics , 2003, Nature.

[117]  Hanah Margalit,et al.  Detection of regulatory circuits by integrating the cellular networks of protein-protein interactions and transcription regulation. , 2003, Nucleic acids research.

[118]  Satoru Miyano,et al.  Use of Gene Networks for Identifying and Validating Drug Targets , 2003, J. Bioinform. Comput. Biol..

[119]  Barbara A. Eckman,et al.  A Practitioner's Guide to Data Management and Data Integration in Bioinformatics , 2003, Bioinformatics.

[120]  D. Searls Pharmacophylogenomics: genes, evolution and drug targets , 2003, Nature Reviews Drug Discovery.

[121]  Ewan Birney,et al.  Discovering novel cis-regulatory motifs using functional networks. , 2003, Genome research.

[122]  Lukasz Huminiecki,et al.  Congruence of tissue expression profiles from Gene Expression Atlas, SAGEmap and TissueInfo databases , 2003, BMC Genomics.

[123]  M. Gribskov,et al.  2 HAPI : a microarray data analysis system , 2003 .

[124]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[125]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[126]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[127]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[128]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[129]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[130]  A. Arkin,et al.  Biological networks. , 2003, Current opinion in structural biology.

[131]  Robert W. Williams,et al.  WebQTL - Web-based complex trait analysis , 2003, Neuroinformatics.

[132]  Eugene M. Fluder,et al.  Text Influenced Molecular Indexing (TIMI): A Literature Database Mining Approach that Handles Text and Chemistry. , 2003 .

[133]  Timothy Galitski,et al.  Inventories to insights , 2003, The Journal of cell biology.

[134]  Claire Tilstone DNA microarrays: Vital statistics , 2003, Nature.

[135]  Harmen J. Bussemaker,et al.  REDUCE: an online tool for inferring cis-regulatory elements and transcriptional module activities from microarray data , 2003, Nucleic Acids Res..

[136]  Edward R. Tufte,et al.  The cognitive style of PowerPoint , 2003 .

[137]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[138]  Peter Norvig PowerPoint: shot with its own bullets , 2003, The Lancet.

[139]  W. Kaiser,et al.  Dyslexia: the possible benefit of multimodal integration of fMRI- and EEG-data , 2004, Journal of Neural Transmission.

[140]  Tatiana Nikolskaya,et al.  Early prediction of drug metabolism and toxicity: systems biology approach and modeling. , 2004, Drug discovery today.

[141]  Tao Xu,et al.  Pegasys: software for executing and integrating analyses of biological sequences , 2004, BMC Bioinformatics.

[142]  Dan M. Bolser,et al.  Large-scale co-evolution analysis of protein structural interlogues using the global protein structural interactome map (PSIMAP) , 2004, Bioinform..

[143]  Nikolay A. Kolchanov,et al.  CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences , 2004, Nucleic Acids Res..

[144]  J. Blake Bio-ontologies—fast and furious , 2004, Nature Biotechnology.

[145]  C. Molony,et al.  Genetic analysis of genome-wide variation in human gene expression , 2004, Nature.

[146]  G. Botti,et al.  Issues in the Design of Medical Ontologies Used for Knowledge Sharing , 2001, Journal of Medical Systems.

[147]  See-Kiong Ng,et al.  ADVICE: Automated Detection and Validation of Interaction by Co-Evolution , 2004, Nucleic Acids Res..

[148]  Benno Schwikowski,et al.  Predicting protein-peptide interactions via a network-based motif sampler , 2004, ISMB/ECCB.

[149]  Emek Demir,et al.  An ontology for collaborative construction and analysis of cellular pathways , 2004, Bioinform..

[150]  Camille Rosenthal-Sabroux,et al.  Using the Unified Modelling Language (UML) to guide the systemic description of biological processes and systems. , 2004, Bio Systems.

[151]  Michael I. Jordan,et al.  Chemogenomic profiling: identifying the functional interactions of small molecules in yeast. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[152]  David P. Kreil,et al.  Independent component analysis of microarray data in the study of endometrial cancer , 2004, Oncogene.

[153]  Junguk Hur,et al.  A graph-theoretic modeling on GO space for biological interpretation of gene clusters , 2004, Bioinform..

[154]  Stefan Wiemann,et al.  High-throughput protein analysis integrating bioinformatics and experimental assays. , 2004, Nucleic acids research.

[155]  Sangsoo Kim,et al.  Integrative analysis of multiple gene expression profiles applied to liver cancer study , 2004, FEBS letters.

[156]  S. L. Wong,et al.  A Map of the Interactome Network of the Metazoan C. elegans , 2004, Science.

[157]  Ravi Iyengar,et al.  Quantitative Information Management for the Biochemical Computation of Cellular Networks , 2004, Science's STKE.

[158]  Joaquín Dopazo,et al.  FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes , 2004, Bioinform..

[159]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[160]  Alexis Groppi,et al.  New strategy for the representation and the integration of biomolecular knowledge at a cellular scale. , 2004, Nucleic acids research.

[161]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[162]  M. Vieth,et al.  Kinomics-structural biology and chemogenomics of kinase inhibitors and targets. , 2004, Biochimica et biophysica acta.

[163]  C. Sander,et al.  The HUPO PSI's Molecular Interaction format—a community standard for the representation of protein interaction data , 2004, Nature Biotechnology.

[164]  E. Jacoby,et al.  Chemogenomics: an emerging strategy for rapid target and drug discovery , 2004, Nature Reviews Genetics.

[165]  Jordi Mestres,et al.  Computational chemogenomics approaches to systematic knowledge-based drug discovery. , 2004, Current opinion in drug discovery & development.

[166]  Dhammika Amaratunga,et al.  Gene expression analysis for high throughput screening applications. , 2004, Combinatorial chemistry & high throughput screening.

[167]  Sean Martin,et al.  Globally distributed object identification for biological knowledgebases , 2004, Briefings Bioinform..

[168]  M. Gerstein,et al.  Integration of genomic datasets to predict protein complexes in yeast , 2004, Journal of Structural and Functional Genomics.

[169]  Ian R White,et al.  Interplay of transcriptomics and proteomics. , 2003, Drug discovery today.

[170]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[171]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[172]  Michael B Yaffe,et al.  Computational prediction of protein-protein interactions. , 2004, Methods in molecular biology.

[173]  See-Kiong Ng,et al.  InterWeaver: interaction reports for discovering potential protein interaction partners with online evidence , 2004, Nucleic Acids Res..

[174]  S. Sumathi,et al.  Statistical Themes and Lessons for Data Mining , 2006 .