Semistructured and structured data in the Web: going back and forth

Database systems offer efficient and reliable technology to query structured data. However, because of the explosion of the World Wide Web [11], an increasing amount of information is stored in repositories organized according to less rigid structures, usually as hypertextual documents, and data access is based on browsing and information retrieval techniques. Since browsing and search engines present important limitations [8], several query languages [19, 20, 23] for the Web have been recently proposed. These approaches are mainly based on a loose notion of structure, and tend to see the Web as a huge collection of unstructured objects, organized as a graph. Clearly, traditional database techniques are of little use in this field, and new techniques need to be developed. In this paper, we present the approach to the management of Web data as attacked in the ArtANEUS project carried out by the database group at Universith di l=toma Tre. Our approach is based on a generalization of the notion of view to the Web framework. In fact, in traditional databases, views represent an essential tool for restructuring and integrating da ta to be presented to the user. Since the Web is becoming a major computing platform and a uniform interface for sharing data, we believe that also in this field a sophisticate view mechanism is needed, with novel features due to the semi-structured nature of the Web. First, in this context, restructuring and presenting da ta under different perspectives requires the generation of derived Web hypertexts, in order to re-organize and re-use portions of the Web. To do this, da ta from existing Web sites must be extracted, and then queried and integrated in order to build new hypertexts, i.e., hypertextual views over the original sites; these manipulations can be better attained in a more structured framework, in which traditional database technology can be leveraged to analyze and correlate information. Therefore, there seem to be different view levels in this framework: (i) at the first level, da ta are extracted from the sites of interest and given a database structure, which represents a first structured view over the original semi-structured data; (ii) then, further database views can be built by means of reorganizations and integrations based on traditional database techniques; (iii) finally, a derived hypertext can be generated offering an alternative or integrated hypertextual view over the original sites. In the process, data go from a loosely structured organizat ion-the Web pages-to a very structured onethe database--and then again to Web structures.

[1]  Peter M. G. Apers Identifying Internet-related Database Research , 1994, East/West Database Workshop.

[2]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[3]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[4]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[5]  Won Kim,et al.  Modern Database Systems: The Object Model, Interoperability, and Beyond , 1995, Modern Database Systems.

[6]  Laks V. S. Lakshmanan,et al.  A declarative language for querying and restructuring the Web , 1996, Proceedings RIDE '96. Sixth International Workshop on Research Issues in Data Engineering.

[7]  Masatoshi Yoshikawa,et al.  ILOG: Declarative Creation and Manipulation of Object Identifiers , 1990, VLDB.

[8]  R. G. G. Cattell,et al.  The Object Database Standard: ODMG-93 , 1993 .

[9]  Guido Moerkotte,et al.  Querying documents in object databases , 1997, International Journal on Digital Libraries.

[10]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[11]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[12]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[13]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[14]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[15]  Michael Kifer,et al.  Querying object-oriented databases , 1992, SIGMOD '92.

[16]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[17]  Serge Abiteboul,et al.  Querying and Updating the File , 1993, VLDB.

[18]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.