Information Retrieval in Life Sciences: A Programmatic Survey

Biomedical databases are a major resource of knowledge for research in the life sciences. The biomedical knowledge is stored in a network of thousands of databases, repositories and ontologies. These data repositories differ substantially in granularity of data, storage formats, database systems, supported data models and interfaces. In order to make full use of available data resources, the high number of heterogeneous query methods and frontends requires high bioinformatic skills. Consequently, the manual inspection of database entries and citations is a time-consuming task for which methods from computer science should be applied.Concepts and algorithms from information retrieval (IR) play a central role in facing those challenges. While originally developed to manage and query less structured data, information retrieval techniques become increasingly important for the integration of life science data repositories and associated information. This chapter provides an overview of IR concepts and their current applications in life sciences. Enriched by a high number of selected references to pursuing literature, the following sections will successively build a practical guide for biologists and bioinformaticians.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Olaf Wolkenhauer,et al.  Reproducibility of Model-Based Results in Systems Biology , 2013 .

[3]  Nicholas T. Carnevale,et al.  ModelDB: A Database to Support Computational Neuroscience , 2004, Journal of Computational Neuroscience.

[4]  Ingo Schmitt Schemaintegration für den Entwurf Föderierter Datenbanken (Kurzfassung) , 1998, Datenbank Rundbr..

[5]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[6]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[7]  Arek Kasprzyk,et al.  BioMart: driving a paradigm change in biological data management , 2011, Database J. Biol. Databases Curation.

[8]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[9]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[10]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[11]  W. Gilbert,et al.  A new method for sequencing DNA. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[12]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[13]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[14]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[15]  R. Doolittle Computer methods for macromolecular sequence analysis , 1996 .

[16]  Perry L. Miller,et al.  Application of Information Technology: Achieving Evolvable Web-Database Bioscience Applications Using the EAV/CR Framework: Recent Advances , 2003, J. Am. Medical Informatics Assoc..

[17]  Ulf Leser,et al.  Next generation data integration for Life Sciences , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  Mário J. Silva,et al.  Relevance Ranking for Geographic IR , 2006, GIR.

[19]  Marti A. Hearst Search User Interfaces , 2009 .

[20]  Peter J. Hunter,et al.  An Overview of CellML 1.1, a Biological Model Description Language , 2003, Simul..

[21]  Peter J. Hunter,et al.  Bioinformatics Applications Note Databases and Ontologies the Physiome Model Repository 2 , 2022 .

[22]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[23]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[24]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[25]  Michael L. Hines,et al.  NeuroML: A Language for Describing Data Driven Models of Neurons and Networks with a High Degree of Biological Detail , 2010, PLoS Comput. Biol..

[26]  Matthias Lange,et al.  The LAILAPS Search Engine: Relevance Ranking in Life Science Databases , 2010, J. Integr. Bioinform..

[27]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[28]  D. Botstein,et al.  Construction of a genetic linkage map in man using restriction fragment length polymorphisms. , 1980, American journal of human genetics.

[29]  E. Klipp,et al.  Retrieval, alignment, and clustering of computational models based on semantic annotations , 2011, Molecular systems biology.

[30]  H. Kitano Systems Biology: A Brief Overview , 2002, Science.

[31]  Marti A. Hearst,et al.  Evidence for Showing Gene/Protein Name Suggestions in Bioscience Literature Search Interfaces , 2007, Pacific Symposium on Biocomputing.

[32]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the World Wide Web. 4. CML Schema , 2003, J. Chem. Inf. Comput. Sci..

[33]  Gio Wiederhold Intelligent Integration of Information - Foreword , 1996, J. Intell. Inf. Syst..

[34]  Peter Buneman,et al.  Challenges in Integrating Biological Data Sources , 1995, J. Comput. Biol..

[35]  Tao Xu,et al.  Atlas – a data warehouse for integrative bioinformatics , 2005, BMC Bioinformatics.

[36]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[37]  A. Brazma,et al.  Standards for systems biology , 2006, Nature Reviews Genetics.

[38]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[39]  S. Colowick,et al.  Methods in Enzymology , Vol , 1966 .

[40]  Raghu V. Hudli,et al.  CORBA fundamentals and programming , 1996 .

[41]  Efthimis N. Efthimiadis,et al.  Interactive query expansion: A user-based evaluation in a relevance feedback environment , 2000, J. Am. Soc. Inf. Sci..

[42]  Uwe Scholz,et al.  Meta-All: a system for managing metabolic pathway information , 2006, BMC Bioinformatics.

[43]  D. Roos,et al.  Bioinformatics--Trying to Swim in a Sea of Data , 2001, Science.

[44]  Olivier Bodenreider,et al.  Bio-ontologies: current trends and future directions , 2006, Briefings Bioinform..

[45]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[46]  Camille Laibe Identifiers.org and MIRIAM Registry: perennial identifiers for crossreferencing purposes , 2011 .

[47]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[48]  Susie Stephens,et al.  Oracle Database 10g: a platform for BLAST search and Regular Expression pattern matching in life sciences , 2004, Nucleic Acids Res..

[49]  Scott Cain,et al.  GMODWeb: a web framework for the generic model organism database , 2008, Genome Biology.

[50]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[51]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[52]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[53]  Nicolas Le Novère,et al.  Identifiers.org and MIRIAM Registry: community resources to provide persistent identification , 2011, Nucleic Acids Res..

[54]  Dina Demner-Fushman,et al.  Application of Information Technology: Essie: A Concept-based Search Engine for Structured Biomedical Text , 2007, J. Am. Medical Informatics Assoc..

[55]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[56]  Edda Klipp,et al.  Annotation and merging of SBML models with semanticSBML , 2010, Bioinform..

[57]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[58]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[59]  Emmanuel Barillot,et al.  XML, bioinformatics and data integration , 2001, Bioinform..

[60]  Sean R. Eddy,et al.  The Distributed Annotation System , 2001, BMC Bioinformatics.

[61]  Peer Kröger,et al.  A Computational Biology Database Digest: Data, Data Analysis, and Data Management , 2004, Distributed and Parallel Databases.

[62]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[63]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles , 1999, J. Chem. Inf. Comput. Sci..

[64]  Michel Dumontier,et al.  Integrating systems biology models and biomedical ontologies , 2011, BMC Systems Biology.

[65]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[66]  Munindar P. Singh,et al.  Readings in agents , 1997 .

[67]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[68]  Uwe Scholz,et al.  IDPredictor: predict database links in biomedical database. , 2012, Journal of integrative bioinformatics.

[69]  Nigel W. Hardy,et al.  Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project , 2008, Nature Biotechnology.

[70]  Alfonso Valencia Search and retrieve , 2002 .

[71]  Michael P Weiner,et al.  Introduction to SNPs: discovery of markers for disease. , 2002, BioTechniques.

[72]  Béla Csukás,et al.  Systems Biology: Integrative Biology and Simulation Tools , 2013 .

[73]  Melanie I. Stefan,et al.  BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models , 2010, BMC Systems Biology.

[74]  L. Stein,et al.  Gramene: Development and Integration of Trait and Gene Ontologies for Rice , 2002, Comparative and functional genomics.

[75]  Chris F. Taylor,et al.  The MGED Ontology: a resource for semantics-based description of microarray experiments , 2006, Bioinform..

[76]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[77]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[78]  Damian Smedley,et al.  BioMart – biological queries made easy , 2009, BMC Genomics.

[79]  Hugh D. Spence,et al.  Minimum information requested in the annotation of biochemical models (MIRIAM) , 2005, Nature Biotechnology.

[80]  W. H. Inmon,et al.  Building the data warehouse , 1992 .

[81]  Peter J. Hunter,et al.  The CellML Model Repository , 2008, Bioinform..

[82]  R Gilmour,et al.  Taxonomic markup language: applying XML to systematic data , 2000, Bioinform..

[83]  Eric Brill,et al.  Improving web search ranking by incorporating user behavior information , 2006, SIGIR.

[84]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[85]  Jacky L. Snoep,et al.  Web-based kinetic modelling using JWS Online , 2004, Bioinform..

[86]  Kraig Brockschmidt Inside OLE (2nd ed.) , 1995 .

[87]  J Day,et al.  The quest for information: a guide to searching the Internet. , 2001, The journal of contemporary dental practice.

[88]  Thure Etzold,et al.  SRS: An Integration Platform for Databanks and Analysis Tools in Bioinformatics , 2003, Bioinformatics.

[89]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[90]  Nicolas Le Novère,et al.  Ranked retrieval of Computational Biology models , 2010, BMC Bioinformatics.

[91]  Lincoln Stein,et al.  The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations , 2008, Nucleic Acids Res..

[92]  Paul T. Murphy,et al.  An Architecture for a Business and Information System , 1988, IBM Syst. J..

[93]  David Fenyö,et al.  The Biopolymer Markup Language , 1999, Bioinform..

[94]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[95]  Zoé Lacroix,et al.  Bioinformatics: Managing Scientific Data , 2013 .

[96]  Priyanka Gupta,et al.  BioWarehouse: a bioinformatics database warehouse toolkit , 2006, BMC Bioinformatics.

[97]  Anthony J. G. Hey,et al.  Jim Gray on eScience: a transformed scientific method , 2009, The Fourth Paradigm.