Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

We present INDUS (Intelligent Data Understanding System), a federated, query-centric system for knowledge acquisition from autonomous, distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables. INDUS employs ontologies and inter-ontology mappings, to enable a user or an application to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user. This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology. We used INDUS framework to design algorithms for learning probabilistic models (e.g., Naive Bayes models) for predicting GO functional classification of a protein based on training sequences that are distributed among SWISSPROT and MIPS data sources. Mappings such as EC2GO and MIPS2GO were used to resolve the semantic differences between these data sources when answering queries posed by the learning algorithms. Our results show that INDUS can be successfully used for integrative analysis of data from multiple sources needed for collaborative discovery in computational biology.

[1]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[2]  V. S. Subrahmanian,et al.  An ontology-extended relational algebra , 2003, Proceedings Fifth IEEE Workshop on Mobile Computing Systems and Applications.

[3]  Vasant Honavar,et al.  A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[4]  Martin L. Kersten,et al.  A Graph-Oriented Model for Articulation of Ontology Interdependencies , 1999, EDBT.

[5]  Renée J. Miller,et al.  Mapping data in peer-to-peer systems: semantics and algorithmic issues , 2003, SIGMOD '03.

[6]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[7]  Peter Mork,et al.  The BioMediator System as a Tool for Integrating Biologic Databases on the Web , 2004 .

[8]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[9]  Thure Etzold,et al.  SRS: An Integration Platform for Databanks and Analysis Tools in Bioinformatics , 2003, Bioinformatics.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  Barbara A. Eckman,et al.  A Practitioner's Guide to Data Management and Data Integration in Bioinformatics , 2003, Bioinformatics.

[12]  Richard Fikes,et al.  Tools for Assembling Modular Ontologies in Ontolingua , 1997, AAAI/IAAI.

[13]  Diego Calvanese,et al.  A Framework for Ontology Integration , 2001, The Emerging Semantic Web.

[14]  Thomas R. Gruber,et al.  Ontolingua: a mechanism to support portable ontologies , 1991 .

[15]  Alexander Borgida,et al.  Distributed Description Logics: Directed Domain Correspondences in Federated Information Sources , 2002, OTM.

[16]  Norman W. Paton,et al.  Complex Query Formulation Over Diverse Information Sources in TAMBIS , 2003, Bioinformatics.

[17]  Val Tannen,et al.  The Information Integration System K2 , 2003, Bioinformatics.

[18]  Vasant Honavar,et al.  Collaborative Ontology Building with Wiki@nt - A Multi-agent Based Ontology Building Environment , 2004, EON.

[19]  Steffen Staab,et al.  Handbook on Ontologies (International Handbooks on Information Systems) , 2004 .

[20]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[21]  Vasant Honavar,et al.  Learning Classifiers from Semantically Heterogeneous Data , 2004, CoopIS/DOA/ODBASE.

[22]  Mark A. Musen,et al.  The Knowledge Model of Protégé-2000: Combining Interoperability and Flexibility , 2000, EKAW.

[23]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[24]  Limsoon Wong,et al.  The Kleisli Query System as a Backbone for Bioinformatics Data Integration and Analysis , 2003, Bioinformatics.

[25]  Richard Hull,et al.  Managing semantic heterogeneity in databases: a theoretical prospective , 1997, PODS.

[26]  Vasant Honavar,et al.  Learning Classifiers for Assigning Protein Sequences to Gene Ontology Functional Families , 2004 .

[27]  Steffen Staab,et al.  International Handbooks on Information Systems , 2013 .

[28]  Werner Nutt,et al.  Basic Description Logics , 2003, Description Logic Handbook.

[29]  I-Min A. Chen,et al.  Exploring Heterogeneous Biological Databases: Tools and Applications , 1998, EDBT.