Towards Logical Hypertext Structure

Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bag-of-words model has been utilized just as HTML tags and link structures. In spite of promising results this adaptation stays in the framework of IR specific models since it neglects the content-based structuring inherent to hypertext units. This paper approaches hypertext modelling from the perspective of graph-theory. It presents an XML-based format for representing websites as hypergraphs. These hypergraphs are used to shed light on the relation of hypertext structure types and their web-based instances. We place emphasis on two characteristics of this relation: In terms of realizational ambiguity we speak of functional equivalents to the manifestation of the same structure type. In terms of polymorphism we speak of a single web unit which manifests different structure types. It is shown that polymorphism is a prevalent characteristic of web-based units. This is done by means of a categorization experiment which analyses a corpus of hypergraphs representing the structure and content of pages of conference websites. On this background we plead for a revision of text representation models by means of hypergraphs which are sensitive to the manifold structuring of web documents.

[1]  Wen-Syan Li,et al.  Defining logical domains in a web site , 2000, HYPERTEXT '00.

[2]  Lada A. Adamic The Small World Web , 1999, ECDL.

[3]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[4]  Allen H. Renear Out of Praxis: Three (Meta)Theories of Textuality , 1997 .

[5]  Keishi Tajima,et al.  New techniques for the discovery of logical documents in Web , 1999, Proceedings 1999 International Symposium on Database Applications in Non-Traditional Environments (DANTE'99) (Cat. No.PR00496).

[6]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[7]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[8]  Kevin S. McCurley,et al.  Untangling compound documents on the web , 2003, HYPERTEXT '03.

[9]  Georg Rehm,et al.  Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic's Personal Homepage , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[10]  Andreas Winter,et al.  An Overview of the GXL Graph Exchange Language , 2001, Software Visualization.

[11]  Sougata Mukherjea,et al.  Focus+context views of World-Wide Web nodes , 1997, HYPERTEXT '97.

[12]  Maristella Agosti,et al.  Information Retrieval and Hypertext , 1996, Information Retrieval and Hypertext.

[13]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[14]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[15]  R. Kuhlen Hypertext : ein nicht-lineares Medium zwischen Buch und Wissensbank , 1991 .

[16]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[17]  Mayer D. Schwartz,et al.  The Dexter Hypertext Reference Model , 1994, CACM.

[18]  Johannes Fürnkranz,et al.  Using Links for Classifying Web-Pages , 1998 .

[19]  Keishi Tajima,et al.  Finding context paths for Web pages , 1999, Hypertext.

[20]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[21]  Peter Willett,et al.  The Representation and Comparison of Hypertext Structures using Graphs in Information Retrieval and , 1996 .

[22]  Claude Berge,et al.  Hypergraphs - combinatorics of finite sets , 1989, North-Holland mathematical library.

[23]  Donia Scott,et al.  Document Structure , 2003, CL.

[24]  Jörg M. Haake,et al.  Hypermedia and cognition: designing for comprehension , 1995, CACM.

[25]  Jörg M. Haake,et al.  Hypermedia and cognition: designing for comprehension : Designing hypermedia applications , 1995 .

[26]  Georg Rehm Towards Automatic Web Genre Identification , 2002, HICSS.

[27]  Lloyd Rutledge,et al.  Generating presentation constraints from rhetorical structure , 2000, HYPERTEXT '00.

[28]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[29]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[30]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[31]  David M. Pennock,et al.  The structure of broad topics on the web , 2002, WWW.

[32]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[33]  James Allan,et al.  Automatic hypertext link typing , 1996 .