Extracting Logical Schema from the Web

One of the main limitations when accessing the web is the lack of explicit structure, whose presence may help in understanding data semantics. Schema for web data can be constructed at different levels, structuring a single pages or a whole site or group of sites. Here we present an approach to give a logical schema to a web-site, first defining a model for a single page, where its contents is divided into “logical” sections, i.e. parts of a page each collecting related information. Then, we introduce a site model in which both physical and logical links among different page sections are represented: physical are existing hyperlinks, while logical links are links between sections containing semantically related information. We show how such links can be found and classified according to their relevance, also showing how schema is used in a structure-aware browser to improve both browsing and searching.

[1]  Stefano Paraboschi,et al.  Design principles for data-intensive Web sites , 1999, SGMD.

[2]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[3]  Vetlozar N Estorov,et al.  Extracting Schema from Semistructured Data S , 1998 .

[4]  Peter M. G. Apers Identifying Internet-related Database Research , 1994, East/West Database Workshop.

[5]  L. Zadeh Fuzzy sets as a basis for a theory of possibility , 1999 .

[6]  Israel Ben-Shaul,et al.  WebCutter: A System for Dynamic and Tailorable Site Mapping , 1997, Comput. Networks.

[7]  Dan Suciu,et al.  Catching the boat with Strudel: experiences with a Web-site management system , 1998, SIGMOD '98.

[8]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[9]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[10]  J W Ballard,et al.  Data on the web? , 1995, Science.

[11]  Vincenza Carchiolo,et al.  Structuring the Web , 2000, Proceedings 11th International Workshop on Database and Expert Systems Applications.

[12]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[13]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[14]  Dan Smith,et al.  Information extraction for semi-structured documents , 1997 .

[15]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[16]  Judith Sylvester,et al.  CNN , 2003 .

[17]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[18]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.