Challenges in integrating Escherichia coli molecular biology data

One key challenge in Systems Biology is to provide mechanisms to collect and integrate the necessary data to be able to meet multiple analysis requirements. Typically, biological contents are scattered over multiple data sources and there is no easy way of comparing heterogeneous data contents. This work discusses ongoing standardisation and interoperability efforts and exposes integration challenges for the model organism Escherichia coli K-12. The goal is to analyse the major obstacles faced by integration processes, suggest ways to systematically identify them, and whenever possible, propose solutions or means to assist manual curation. Integration of gene, protein and compound data was evaluated by performing comparisons over EcoCyc, KEGG, BRENDA, ChEBI, Entrez Gene and UniProt contents. Cross-links, a number of standard nomenclatures and name information supported the comparisons. Except for the gene integration scenario, in no other scenario an element of integration performed well enough to support the process by itself. Indeed, both the integration of enzyme and compound records imply considerable curation. Results evidenced that, even for a well-studied model organism, source contents are still far from being as standardized as it would be desired and metadata varies considerably from source to source. Before designing any data integration pipeline, researchers should decide on the sources that best fit the purpose of analysis and be aware of existing conflicts/inconsistencies to be able to intervene in their resolution. Moreover, they should be aware of the limits of automatic integration such that they can define the extent of necessary manual curation for each application.

[1]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[2]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[3]  Joyce A. Mitchell,et al.  The BioMediator System as a Data Integration Tool to Answer Diverse Biologic Queries , 2004, MedInfo.

[4]  Priyanka Gupta,et al.  BioWarehouse: a bioinformatics database warehouse toolkit , 2006, BMC Bioinformatics.

[5]  L. Stein Integrating biological databases , 2003, Nature Reviews Genetics.

[6]  Renate Kania,et al.  SABIO-RK: A data warehouse for biochemical reactions and their kinetics , 2007, J. Integr. Bioinform..

[7]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[8]  Jacob Köhler,et al.  Integration of life science databases , 2004 .

[9]  Thomas Steinke,et al.  Columba: Multidimensional Data Integration of Protein Annotations , 2004, DILS.

[10]  Zhilei Chen,et al.  A highly sensitive selection method for directed evolution of homing endonucleases , 2005, Nucleic acids research.

[11]  Christie S. Chang,et al.  The BioGRID interaction database: 2013 update , 2012, Nucleic Acids Res..

[12]  Carole A. Goble,et al.  TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources , 1998, ISMB.

[13]  Christina Backes,et al.  BNDB – The Biochemical Network Database , 2007, BMC Bioinformatics.

[14]  Martin Hofmann-Apitius,et al.  Chemical Names: Terminological Resources and Corpora Annotation , 2008, LREC 2008.

[15]  Antje Chang,et al.  BRENDA, AMENDA and FRENDA: the enzyme information system in 2007 , 2007, Nucleic Acids Res..

[16]  Denis Thieffry,et al.  RegulonDB: a database on transcriptional regulation in Escherichia coli , 1998, Nucleic Acids Res..

[17]  Peter D. Karp,et al.  The comprehensive updated regulatory network of Escherichia coli K-12 , 2006, BMC Bioinformatics.

[18]  Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Enzyme nomenclature. Recommendations 1984. Supplement 1: Corrections and additions. , 1986, European journal of biochemistry.

[19]  Beatriz García Jiménez,et al.  EcID. A database for the inference of functional interactions in E. coli , 2008, Nucleic Acids Res..

[20]  E. Webb,et al.  Enzyme nomenclature. Recommendations 1984. Supplement 2: corrections and additions. , 1989, European journal of biochemistry.

[21]  Jacob Köhler,et al.  Addressing the problems with life-science databases for traditional uses and systems biology , 2006, Nature Reviews Genetics.

[22]  J. Förster,et al.  Design and application of genome-scale reconstructed metabolic models. , 2008, Methods in molecular biology.

[23]  Monica Riley,et al.  Escherichia coli K-12: a cooperatively developed annotation snapshot—2005 , 2006, Nucleic acids research.

[24]  Antje Chang,et al.  BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009 , 2008, Nucleic Acids Res..

[25]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[26]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2009 update , 2009, Nucleic Acids Res..

[27]  Peter D. Karp,et al.  The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases , 2007, Nucleic Acids Res..

[28]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[29]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2008 update , 2008, Nucleic Acids Res..

[30]  Peter D. Karp,et al.  EcoCyc: a comprehensive database resource for Escherichia coli , 2004, Nucleic Acids Res..

[31]  Junjun Zhang,et al.  BioMart Central Portal—unified access to biological data , 2009, Nucleic Acids Res..

[32]  Michael R. Seringhaus,et al.  Uncovering trends in gene naming , 2008, Genome Biology.

[33]  L. Wackett An annotated selection of World Wide Web sites relevant to the topics in Microbial Biotechnology , 2013, Microbial biotechnology.

[34]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[35]  Michael Y. Galperin,et al.  The 2010 Nucleic Acids Research Database Issue and online Database Collection: a community of data resources , 2009, Nucleic Acids Res..

[36]  Renata C Geer,et al.  Entrez: making use of its power. , 2003, Briefings in bioinformatics.

[37]  Christopher J. Rawlings,et al.  Data integration for plant genomics - exemplars from the integration of Arabidopsis thaliana databases , 2009, Briefings Bioinform..

[38]  Svetlana Gerdes,et al.  Microbial Gene Essentiality: Protocols and Bioinformatics , 2008, Methods in Molecular Biology™.

[39]  M. Vidal,et al.  Integrating 'omic' information: a bridge between genomics and systems biology. , 2003, Trends in genetics : TIG.

[40]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[41]  Julio Collado-Vides,et al.  RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions , 2005, Nucleic Acids Res..

[42]  Ibrahim Emam,et al.  ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression , 2008, Nucleic Acids Res..

[43]  R. E. O'Dette The CAS data base , 1977 .

[44]  José Francisco Aldana Montes,et al.  KA-SB: from data integration to large scale reasoning , 2009, BMC Bioinformatics.

[45]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[46]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2010, Nucleic Acids Res..

[47]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[48]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[49]  Hiroaki Kitano,et al.  Looking beyond the details: a rise in system-oriented approaches in genetics and molecular biology , 2002, Current Genetics.

[50]  Christopher J. Rawlings,et al.  Graph-based analysis and visualization of experimental results with ONDEX , 2006, Bioinform..