XML-based approaches for the integration of heterogeneous bio-molecular data

BackgroundThe today's public database infrastructure spans a very large collection of heterogeneous biological data, opening new opportunities for molecular biology, bio-medical and bioinformatics research, but raising also new problems for their integration and computational processing.ResultsIn this paper we survey the most interesting and novel approaches for the representation, integration and management of different kinds of biological data by exploiting XML and the related recommendations and approaches. Moreover, we present new and interesting cutting edge approaches for the appropriate management of heterogeneous biological data represented through XML.ConclusionXML has succeeded in the integration of heterogeneous biomolecular information, and has established itself as the syntactic glue for biological data sources. Nevertheless, a large variety of XML-based data formats have been proposed, thus resulting in a difficult effective integration of bioinformatics data schemes. The adoption of a few semantic-rich standard formats is urgent to achieve a seamless integration of the current biological resources.

[1]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[2]  Gary D. Bader,et al.  cPath: open source software for collecting, storing, and querying biological pathways , 2006, BMC Bioinformatics.

[3]  Bijan Parsia,et al.  Repairing Unsatisfiable Concepts in OWL Ontologies , 2006, ESWC.

[4]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[5]  Christian J. A. Sigrist,et al.  Nucleic Acids Research Advance Access published November 14, 2007 The 20 years of PROSITE , 2007 .

[6]  Golan Yona,et al.  BIOZON: a system for unification, management and analysis of heterogeneous biological data , 2006, BMC Bioinformatics.

[7]  V. S. Subrahmanian,et al.  A multi-similarity algebra , 1998, SIGMOD '98.

[8]  Simon C. Potter,et al.  An overview of Ensembl. , 2004, Genome research.

[9]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[10]  Jürgen Pleiss,et al.  DWARF – a data warehouse system for analyzing protein families , 2006, BMC Bioinformatics.

[11]  Priyanka Gupta,et al.  BioWarehouse: a bioinformatics database warehouse toolkit , 2006, BMC Bioinformatics.

[12]  Rachel Pottinger,et al.  Semi-automatic web service composition for the life sciences using the BioMoby semantic web framework , 2008, J. Biomed. Informatics.

[13]  W. Miller,et al.  PhenCode: connecting ENCODE data with mutations and phenotype , 2007, Human mutation.

[14]  Bradley Malin,et al.  Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..

[15]  Dan Wu,et al.  EMBL Nucleotide Sequence Database in 2006 , 2006, Nucleic Acids Res..

[16]  V. McKusick Mendelian Inheritance in Man and Its Online Version, OMIM , 2007, The American Journal of Human Genetics.

[17]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[18]  Emil C. Lupu,et al.  Security and management policy specification , 2002, IEEE Netw..

[19]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[20]  Verena Kantere,et al.  The hyperion project: from data integration to data coordination , 2003, SGMD.

[21]  Renée J. Miller,et al.  Kanata: adaptation and evolution in data sharing systems , 2004, SGMD.

[22]  Peter Mork,et al.  The BioMediator System as a Tool for Integrating Biologic Databases on the Web , 2004 .

[23]  S. Sudarshan,et al.  Extending query rewriting techniques for fine-grained access control , 2004, SIGMOD '04.

[24]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[25]  Daniel Hanisch,et al.  ProML - the Protein Markup Language for specification of protein sequences, structures and families , 2002, Silico Biol..

[26]  Peter Haase,et al.  Eu-ist Integrated Project (ip) Ist-2003-506826 Sekt D3.1.1.b State-of-the-art on Ontology Evolution Sekt Consortium , 2004 .

[27]  Huiming Ding,et al.  The synthetic genetic interaction spectrum of essential genes , 2005, Nature Genetics.

[28]  Eric Werner,et al.  All systems go , 2007, Nature.

[29]  Kara Dolinski,et al.  Saccharomyces cerevisiae S288C genome annotation: a working hypothesis , 2006, Yeast.

[30]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[32]  Jack A. M. Leunissen,et al.  Evolution of web services in bioinformatics , 2005, Briefings Bioinform..

[33]  A. Brazma,et al.  Standards for systems biology , 2006, Nature Reviews Genetics.

[34]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[35]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the World Wide Web. 4. CML Schema , 2003, J. Chem. Inf. Comput. Sci..

[36]  Peter Buneman,et al.  Challenges in Integrating Biological Data Sources , 1995, J. Comput. Biol..

[37]  Jérôme Gouzy,et al.  ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[38]  Lawrence Tagg Services , 1987 .

[39]  Karl Aberer,et al.  GridVine: An Infrastructure for Peer Information Management , 2007, IEEE Internet Computing.

[40]  Mark E. Dalphin,et al.  The translational signal database, TransTerm, is now a relational database , 1998, Nucleic Acids Res..

[41]  Eric van der Vlist,et al.  XML Schema , 2002 .

[42]  Chao Qian,et al.  Population , 1940, State Rankings 2020: A Statistical View of America.

[43]  Jason E. Stewart,et al.  Design and implementation of microarray gene expression markup language (MAGE-ML) , 2002, Genome Biology.

[44]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[45]  R GruberThomas Toward principles for the design of ontologies used for knowledge sharing , 1995 .

[46]  Patrick Lambrix,et al.  Representations of molecular pathways: an evaluation of SBML, PSI MI and BioPAX , 2005, Bioinform..

[47]  Jérôme Euzenat,et al.  Ten Challenges for Ontology Matching , 2008, OTM Conferences.

[48]  Carole A. Goble,et al.  Transparent access to multiple bioinformatics information sources , 2001, IBM Syst. J..

[49]  José Francisco Aldana Montes,et al.  Intelligent client for integrating bioinformatics services , 2006, Bioinform..

[50]  Tao Xu,et al.  Pegasys: software for executing and integrating analyses of biological sequences , 2004, BMC Bioinformatics.

[51]  Edgar Wingender,et al.  The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation , 2008, Briefings Bioinform..

[52]  Michael Y. Galperin The Molecular Biology Database Collection: 2008 update , 2007, Nucleic Acids Res..

[53]  James W. Brown,et al.  RNAML: a standard syntax for exchanging RNA information. , 2002, RNA.

[54]  Jacob Köhler,et al.  Integration of life science databases , 2004 .

[55]  Rolf Apweiler,et al.  The EBI SRS Server: Recent Developments , 2002, German Conference on Bioinformatics.

[56]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[57]  Lipyeow Lim,et al.  Preserving XML queries during schema evolution , 2007, WWW '07.

[58]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[59]  Nicolas Le Novère,et al.  MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology , 2007, BMC Systems Biology.

[60]  Graziano Pesole,et al.  UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs , 2004, Nucleic Acids Res..

[61]  Rick Durrett,et al.  Population Genetics of Polymorphism and Divergence Under Fluctuating Selection , 2008, Genetics.

[62]  Tao Xu,et al.  Atlas – a data warehouse for integrative bioinformatics , 2005, BMC Bioinformatics.

[63]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): mouse biology and model systems , 2007, Nucleic Acids Res..

[64]  Krzysztof J. Cios,et al.  Uniqueness of medical data mining , 2002, Artif. Intell. Medicine.

[65]  Giorgio Valentini,et al.  Data integration issues and opportunities in biological XML data management , 2009 .

[66]  Patrick Lambrix,et al.  A review of standards for data exchange within systems biology , 2007, Proteomics.

[67]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[68]  Torben Bach Pedersen,et al.  Integrating Data Warehouses with Web Data: A Survey , 2008, IEEE Transactions on Knowledge and Data Engineering.

[69]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[70]  Ela Hunt,et al.  Index-Driven XML Data Integration to Support Functional Genomics , 2004, DILS.

[71]  Rafael Berlanga Llavori,et al.  Fragment-based approximate retrieval in highly heterogeneous XML collections , 2008, Data Knowl. Eng..

[72]  Karen Schlauch,et al.  GeneX: An Open Source gene expression database and integrated tool set , 2001, IBM Syst. J..

[73]  Beng Chin Ooi,et al.  PeerDB: peering into personal databases , 2003, SIGMOD '03.

[74]  Martin Senger,et al.  BioMoby extensions to the Taverna workflow management and enactment software , 2006, BMC Bioinformatics.

[75]  Hiroaki Kitano,et al.  The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models , 2003, Bioinform..

[76]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[77]  David P. Bullivant,et al.  CellML 1.1 for the definition and exchange of biological models , 2003 .

[78]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[79]  Golan Yona,et al.  Hubs of knowledge: using the functional link structure in Biozon to mine for biologically significant entities , 2006, BMC Bioinformatics.

[80]  Giovanna Guerrini,et al.  X-Evolution: A Comprehensive Approach for XML Schema Evolution , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[81]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[82]  D. Sandbach All systems go. , 1986, The Health service journal.

[83]  Christoph W. Sensen,et al.  Seahawk: moving beyond HTML in Web-based bioinformatics analysis , 2007, BMC Bioinformatics.

[84]  Henning Hermjakob,et al.  The HUPO proteomics standards initiative - easing communication and minimizing data loss in a changing world , 2007, Briefings Bioinform..

[85]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[86]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[87]  Heiner Stuckenschmidt,et al.  Supporting Manual Mapping Revision using Logical Reasoning , 2008, AAAI.

[88]  Kunihiko Kaneko,et al.  Life: An Introduction to Complex Systems Biology , 2006 .

[89]  Mark D. Wilkinson,et al.  BioMOBY: An Open Source Biological Web Services Proposal , 2002, Briefings Bioinform..

[90]  Rami Rifaieh,et al.  SWAMI: Integrating Biological Databases and Analysis Tools Within User Friendly Environment , 2007, DILS.

[91]  Emil C. Lupu,et al.  Conflicts in Policy-Based Distributed Systems Management , 1999, IEEE Trans. Software Eng..

[92]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[93]  Terri K. Attwood,et al.  The PRINTS Database: A Resource for Identification of Protein Families , 2002, Briefings Bioinform..

[94]  金子 邦彦,et al.  Life : an introduction to complex systems biology , 2006 .

[95]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[96]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[97]  Silvana Castano,et al.  Database Security , 1997, IFIP Advances in Information and Communication Technology.

[98]  Ian Horrocks,et al.  Ontology Integration Using Mappings: Towards Getting the Right Logical Consequences , 2009, ESWC.

[99]  Frank Neven,et al.  BioScout: a life-science query monitoring system , 2008, EDBT '08.

[100]  Sean R. Eddy,et al.  The Distributed Annotation System , 2001, BMC Bioinformatics.

[101]  Jacek Blazewicz,et al.  Web and Grid Technologies in Bioinformatics, Computational and Systems Biology: A Review , 2008 .

[102]  Erhard Rahm,et al.  Analyzing the Evolution of Life Science Ontologies and Mappings , 2008, DILS.

[103]  Neoklis Polyzotis,et al.  Approximate XML query answers , 2004, SIGMOD '04.

[104]  Athena Vakali,et al.  XML Document Indexes: A Classification , 2005, IEEE Internet Comput..

[105]  Gary D. Bader,et al.  BioPAX - Biological Pathways Exchange Language Level 2, Version 1.0 Documentation , 2005 .

[106]  Miguel García-Remesal,et al.  ONTOFUSION: Ontology-based integration of genomic and clinical databases , 2006, Comput. Biol. Medicine.

[107]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[108]  Alexandra Poulovassilis,et al.  Bioinformatics Service Reconciliation by Heterogeneous Schema Transformation , 2007, DILS.

[109]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[110]  Jérôme Gouzy,et al.  REMORA: a pilot in the ocean of BioMoby web-services , 2006, Bioinform..

[111]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[112]  Joshua L. Goodman,et al.  FlyBase: integration and improvements to query tools , 2007, Nucleic Acids Res..

[113]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[114]  Carole A. Goble,et al.  myGrid: personalised bioinformatics on the information grid , 2003, ISMB.

[115]  Jim Melton,et al.  Querying XML,: XQuery, XPath, and SQL/XML in context (The Morgan Kaufmann Series in Data Management Systems) (The Morgan Kaufmann Series in Data Management Systems) , 2006 .

[116]  Renée J. Miller,et al.  Mapping data in peer-to-peer systems: semantics and algorithmic issues , 2003, SIGMOD '03.

[117]  Haruki Nakamura,et al.  The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data , 2006, Nucleic Acids Res..

[118]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[119]  John Mylopoulos,et al.  ToMAS: a system for adapting mappings while schemas evolve , 2004, Proceedings. 20th International Conference on Data Engineering.

[120]  Gail-Joon Ahn,et al.  Role-based authorization constraints specification , 2000, TSEC.

[121]  Etzard Stolte,et al.  B-Fabric: A Data and Application Integration Framework for Life Sciences Research , 2007, DILS.