Enhanced bioinformatics data modeling concepts and their use in querying and integration

In bioinformatics research, scientists usually face the problems of modeling complex data types and integrating diverse resources. Traditional data models such as EER lack the expressing power to capture many characteristics that are common in bioinformatics data. We first propose extensions to the ER model that allow accurate representation of many of these characteristics. We then utilize these concepts in an integrative system to provide an easy-to-use interface for biologists to construct queries. Our research utilizes the enhanced conceptual modeling concepts to create a prototype mediator for querying multiple data sources. The various relationships between different biological entities are all semantically represented as domain ontologies stored in the mediator for experts to analyze and correlate the integrated query results. The following research has been conducted: (1) We first propose new EER schema notation to represent the common occurring biological concepts: the ordering properties of the DNA sequences, the 3D structure of proteins and the functional processes of metabolic pathways. (2) Then, we utilize these new relationships in the development of the mediated domain ontology, which helps the interface design and query processor implementation of our mediator system. Our mediated schema features are based on a hybrid of taxonomy ontologies (core concepts and external classification/annotation concepts) for interpretation of raw data sets (protein and gene sequences) in the context of molecular interactions, biochemical pathways and biological processes. We adopt the RDF data model to implement the mediation data. Our mediator mainly takes a browsing-based approach to integrate different data sources. Extra data can be dynamically retrieved through the web service. By browsing the ontology tree in the query interface, users can select concepts of interest and associated attributes to formulate queries based on their domain knowledge. The query result is a set of various database entry accessions with associated attribute values. Users can click each link of the accessions to see the detailed reports, or cross-compare attributes of these data instances. Query usability and performance experiments are tested for real data sets from UniProt [30], ENZYME [8], CATH [23], and GO [29].

[1]  S Schwartz,et al.  A database of experimental results on globin gene expression. , 1998, Genomics.

[2]  Ramez Elmasri,et al.  Extending EER Modeling Concepts for Biological Data , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[3]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[4]  Daniel L. Hartl,et al.  GeneMerge - Post-genomic Analysis, Data Mining, and Hypothesis Testing , 2003, Bioinform..

[5]  Philip E. Bourne,et al.  The Protein Data Bank and lessons in data management , 2004, Briefings Bioinform..

[6]  Ramez Elmasri,et al.  Multi-level Conceptual Modeling for Biomedical Data and Ontologies Integration , 2007, Twentieth IEEE International Symposium on Computer-Based Medical Systems (CBMS'07).

[7]  Ramez Elmasri,et al.  Modelling concepts and database implementation techniques for complex biological data , 2007, Int. J. Bioinform. Res. Appl..

[8]  Carole A. Goble,et al.  Transparent access to multiple bioinformatics information sources , 2001, IBM Syst. J..

[9]  Felix Naumann,et al.  A Data Model and Query Language to Explore Enhanced Links and Paths in Life Science Sources , 2005, WebDB.

[10]  Kai-Uwe Sattler,et al.  Concept-based querying in mediator systems , 2005, The VLDB Journal.

[11]  Kei-Hoi Cheung,et al.  YeastHub: a semantic web use case for integrating data in the life sciences domain , 2005, ISMB.

[12]  David Botstein,et al.  SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data , 2003, Nucleic Acids Res..

[13]  Heiner Stuckenschmidt,et al.  Index structures and algorithms for querying distributed RDF repositories , 2004, WWW '04.

[14]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[15]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[16]  Ramez Elmasri,et al.  Multi-level biomedical ontology-enabled service broker for web-based interoperation , 2008, SAC '08.

[17]  Antje Chang,et al.  BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009 , 2008, Nucleic Acids Res..

[18]  Subbarao Kambhampati,et al.  Integration of biological sources: current systems and challenges ahead , 2004, SGMD.

[19]  J. Mcentyre,et al.  Linking up with Entrez. , 1998, Trends in genetics : TIG.

[20]  Luciano Milanesi,et al.  Web services and workflow management for biological resources , 2005, BMC Bioinformatics.

[21]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[22]  Ramez Elmasri,et al.  BioSO : Bioinformatic Service Ontology for Dynamic Biomedical Web Services Integration , 2008 .

[23]  Giorgio Valle,et al.  The Gene Ontology project in 2008 , 2007, Nucleic Acids Res..

[24]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[25]  Gultekin Özsoyoglu,et al.  Pathways Database System: An Integrated System for Biological Pathways , 2003, Bioinform..

[26]  Lincoln Stein,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Res..

[27]  I-Min A Chen,et al.  An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools , 1995, Inf. Syst..

[28]  Ian M. Donaldson,et al.  The Biomolecular Interaction Network Database and related tools 2005 update , 2004, Nucleic Acids Res..

[29]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[30]  Margaret Gardiner-Garden,et al.  A Comparison of Microarray Databases , 2001, Briefings Bioinform..

[31]  Lennart Martens,et al.  PRIDE: new developments and new datasets , 2007, Nucleic Acids Res..

[32]  Xuan Zhang,et al.  A Tool for Supporting Integration Across Multiple Flat-File Datasets , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[33]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[34]  William C Reinhold,et al.  MatchMiner: a tool for batch navigation among gene and gene product identifiers , 2003, Genome Biology.

[35]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[36]  Paul F. Bugni,et al.  A knowledgebase system to enhance scientific discovery: Telemakus , 2004, Biomedical digital libraries.

[37]  Floris Geerts,et al.  MONDRIAN: Annotating and Querying Databases through Colors and Blocks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[38]  Patrick Lambrix,et al.  Ontology-based integration for bioinformatics , 2005 .

[39]  Christopher Dubay,et al.  BioQuery: An Object Framework for Building Queries to Biomedical Databases , 2003, Bioinform..

[40]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[41]  P D Karp,et al.  Pathway Databases: A Case Study in Computational Symbolic Theories , 2001, Science.

[42]  Andrew C. R. Martin Databases and ontologies Mapping PDB chains to UniProtKB entries , 2005 .

[43]  Zukang Feng,et al.  The Protein Data Bank and structural genomics , 2003, Nucleic Acids Res..

[44]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[45]  M C Peitsch,et al.  Protein modelling for all. , 1999, Trends in biochemical sciences.

[46]  Peer Kröger,et al.  A Computational Biology Database Digest: Data, Data Analysis, and Data Management , 2004, Distributed and Parallel Databases.

[47]  Shamkant B. Navathe,et al.  MITOMAP: a human mitochondrial genome database--1998 update , 1998, Nucleic Acids Res..

[48]  A. Rector,et al.  Relations in biomedical ontologies , 2005, Genome Biology.

[49]  Olivier Bodenreider,et al.  Semantic webs for life sciences. , 2006, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[50]  Chong Su,et al.  Bacteriome.org—an integrated protein interaction database for E. coli , 2007, Nucleic Acids Res..

[51]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[52]  Ioannis Xenarios,et al.  DIP: The Database of Interacting Proteins: 2001 update , 2001, Nucleic Acids Res..

[53]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[54]  L. Stein Creating a bioinformatics nation , 2002, Nature.

[55]  Peter Mork,et al.  The Multiple Roles of Ontologies in the BioMediator Data Integration System , 2005, DILS.

[56]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[57]  Martin Vingron,et al.  IntAct: an open source molecular interaction database , 2004, Nucleic Acids Res..

[58]  Ewan Birney,et al.  Biological database design and implementation , 2004, Briefings Bioinform..

[59]  H. V. Jagadish,et al.  Biological Data Management: Research, Practice and Opportunities , 2004, VLDB.

[60]  Michael Y. Galperin The Molecular Biology Database Collection: 2008 update , 2007, Nucleic Acids Res..

[61]  Rolf Apweiler,et al.  The EBI SRS Server: Recent Developments , 2002, German Conference on Bioinformatics.

[62]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[63]  Sheng Zhong,et al.  Towards ubiquitous bio-information computing: data protocols, middleware, and Web services for heterogeneous biological information integration and retrieval , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[64]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[65]  Peter A. C. 't Hoen,et al.  Microarray retriever: a web-based tool for searching and large scale retrieval of public microarray data , 2008, Nucleic Acids Res..

[66]  Omran Bukhres,et al.  Complex life science multidatabase queries , 2002 .

[67]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[68]  S. Wodak,et al.  Representing and Analysing Molecular and Cellular Function Using the Computer , 2000, Biological chemistry.

[69]  Markus Schneider,et al.  Going Back to Our Database Roots for Managing Genomic Data , 2003, OMICS.

[70]  Purvesh Khatri,et al.  Babel's tower revisited: a universal resource for cross-referencing across annotation databases , 2006, Bioinform..

[71]  Stefan Decker,et al.  A Scalable Framework for the Interoperation of Information Sources , 2001, SWWS.

[72]  Carole A. Goble,et al.  Conceptual modelling of genomic information , 2000, Bioinform..

[73]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[74]  Calton Pu,et al.  Querying multiple bioinformatics information sources: can semantic web research help? , 2002, SGMD.

[75]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[76]  John V. Carlis,et al.  Genomic data modeling , 2003, Inf. Syst..

[77]  Wei Wei,et al.  Modeling the Semantics of 3D Protein Structures , 2004, ER.

[78]  Norman W. Paton,et al.  Conceptual data modelling for bioinformatics , 2002, Briefings Bioinform..

[79]  Amarnath Gupta,et al.  BiologicalNetworks: visualization and analysis tool for systems biology , 2006, Nucleic Acids Res..

[80]  J. Patel,et al.  Declarative Querying for Biological Sequences , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[81]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[82]  Christoph W. Sensen,et al.  Semantic Web Service provision: a realistic framework for Bioinformatics programmers , 2007, Bioinform..

[83]  Ramez Elmasri,et al.  Incorporating concepts for bioingormatics data modeling into EER models , 2005, The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005..

[84]  Hideaki Sugawara,et al.  DDBJ in collaboration with mass-sequencing teams on annotation , 2004, Nucleic Acids Res..

[85]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[86]  Antje Chang,et al.  BRENDA, AMENDA and FRENDA: the enzyme information system in 2007 , 2007, Nucleic Acids Res..

[87]  Omran A. Bukhres,et al.  On the Integration of a Large Number of Life Science Web Databases , 2004, DILS.

[88]  Kei-Hoi Cheung,et al.  Advancing translational research with the Semantic Web , 2007, BMC Bioinformatics.