Structuring the Web

The WWW is a very large and rich information source but with no structure, so locating data of interest may be difficult. In particular a page may be divided into different logical sections of information, whose highlighting may improve both browsing and searching. We propose a simple Web page structuring, by introducing the "semantic block" as a more granular level to categorize information inside a page. We also propose a set of XML tags to be added to the existing HTML tags in order to locate such blocks and to use structured pages both with current and future, structure-aware browsers, reaching the goal of a gradual migration towards a more structured Web. We explore our technique on several Web sites, in order to detect which semantic blocks are needed, also using two simple Java-based tools we developed to add XML tags and manage such structure. Finally, we consider how schema can be represented for a better browsing.

[1]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[2]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[3]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[4]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[5]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[6]  Dan Suciu,et al.  Semistructured Data and XML , 2001, FODO.

[7]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[8]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[9]  Dan Smith,et al.  Information extraction for semi-structured documents , 1997 .

[10]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[11]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[12]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[13]  Srinivas Bangalore,et al.  The Institute For Research In Cognitive Science Disambiguation of Super Parts of Speech ( or Supertags ) : Almost Parsing by Aravind , 1995 .

[14]  Arnaud Sahuguet,et al.  Looking at the Web through XML glasses , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[15]  Peter M. G. Apers Identifying Internet-related Database Research , 1994, East/West Database Workshop.