Scientific data integration: wrapping textual documents with a database view mechanism and an XML engine

Building a digital library for scientific data requires accessing and manipulating data extracted from flat files or from documents retrieved from the World Wide Web. We present an approach to querying flat files as well as Web data sources through an object database view based on a database system and a wrapper. Generally, a wrapper has two tasks: it first sends a query to the source to retrieve data and, secondly builds the expected output with respect to the virtual structure. Scientific data servers, and in particular the ones publicly available on the Web, usually provide information retrieval techniques to access data. Our wrappers are composed of a retrieval component, based on an intermediate object view mechanism called 'search views' mapping the source capabilities to attributes, and a XML engine to perform these two tasks. If the retrieval component is specific to each data source, this approach shows that the extraction component (the XML engine) can be common. We describe our system and focus on the retrieval component of the Object-Web Wrapper (OWW) for Web sources. The originality of our approach consists of (1) a common wrapper architecture for flat files and Web data sources sharing a XML engine for data extraction, (2) a generic view mechanism to access data sources with limited capabilities, and (3) the representation of hyperlinks as abstract attributes in the object view as well as their use in the search view. Our approach has been developed and demonstrated as part of a multidatabase system supporting queries via uniform Object Protocol Model (OPM) interfaces.

[1]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[2]  Guido Moerkotte,et al.  Querying documents in object databases , 1997, International Journal on Digital Libraries.

[3]  Zoé Lacroix Querying Annotated Scientific Data Combining Object-Oriented View and Information Retrieval , 2000, RIAO.

[4]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[5]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[6]  Maria-Esther Vidal,et al.  Wrapper generation for Web accessible data sources , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[7]  Terence Critchlow,et al.  Automatic Generation of Warehouse Mediators Using an Ontology Engine , 1998, KRDB.

[8]  I-Min A. Chen,et al.  Constructing and maintaining scientific database views in the framework of the object-protocol model , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[9]  Zoé Lacroix Object Views through Search Views of Web Datasources , 1999, ER.

[10]  Dan Suciu,et al.  STRUDEL: a Web site management system , 1997, SIGMOD '97.

[11]  Edward Y. Chang,et al.  Query planning with limited source capabilities , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[12]  Roy Goldman,et al.  From Semistructured Data to XML: Migrating the Lore Data Model and Query Language , 1999, Markup Lang..

[13]  Surajit Chaudhuri,et al.  Join queries with external text sources: execution and optimization techniques , 1995, SIGMOD '95.

[14]  Tiziana Catarci,et al.  Conceptual Views over the Web , 1997, KRDB.

[15]  Nelson Mendonça Mattos,et al.  Integrating SQL Databases with Content-Specific Search Engines , 1997, VLDB.

[16]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[17]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[18]  I-Min A Chen,et al.  An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools , 1995, Inf. Syst..

[19]  Thure Etzold,et al.  SRS - an indexing and retrieval tool for flat file data libraries , 1993, Comput. Appl. Biosci..

[20]  I-Min A. Chen,et al.  Exploring Heterogeneous Biological Databases: Tools and Applications , 1998, EDBT.

[21]  Roy Goldman,et al.  WSQ/DSQ: a practical approach for combined querying of databases and the Web , 2000, SIGMOD '00.

[22]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[23]  Carole A. Goble,et al.  TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources , 1998, ISMB.

[24]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[25]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[26]  Limsoon Wong Some MEDLINE Queries Powered By Kleisli , 1998 .