Retrieving and organizing web pages by “information unit”

Since WWW encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to eAEciently retrieve information units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental results on synthetic graphs and real Web data show the effectiveness and usefulness of the proposed information unit retrieval technique.

[1]  R. Ravi,et al.  A polylogarithmic approximation algorithm for the group Steiner tree problem , 2000, SODA '98.

[2]  Roy Goldman,et al.  Proximity Search in Databases , 1998, VLDB.

[3]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[4]  Sougata Mukherjea,et al.  Focus+context views of World-Wide Web nodes , 1997, HYPERTEXT '97.

[5]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[6]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[7]  Keishi Tajima,et al.  Discovery and Retrieval of Logical Information Units in Web , 1999, WOWS.

[8]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[9]  Alex Zelikovsky,et al.  Provably good routing tree construction with multi-port terminals , 1997, ISPD '97.

[10]  Edmund Ihler,et al.  Bounds on the quality of approximate solutions to the Group Steiner Problem , 1990, WG.

[11]  Shaul Dar,et al.  DTL's DataSpot: Database Exploration Using Plain Language , 1998, VLDB.

[12]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[13]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[14]  Keishi Tajima,et al.  Cut as a querying unit for WWW, Netnews, and E-mail , 1998, HYPERTEXT '98.

[15]  John Murphy,et al.  Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words , 1994 .

[16]  S. Louis Hakimi,et al.  Steiner's problem in graphs and its implications , 1971, Networks.

[17]  W. Bruce Croft,et al.  Providing Government Information on the Internet: Experiences with THOMAS , 1995, DL.