Discovery and Retrieval of Logical Information Units in Web

In ordinary search engines for Web pages, the data unit for query processing is individual pages. Indexes are produced for each page in accordance with the words appearing in it. In actual Web data, however, a logical document discussing one topic is often organized into a set of pages connected via links provided by the page author as “standard navigation routes.” In such a situation, conjunctive queries with multiple keywords may fail to retrieve an appropriate document if those keywords appear in different pages within that document. Therefore, a data unit for Web data retrieval should not be a page but should be a connected subgraph corresponding to one logical document. In this paper, we develop new techniques for discovering and retrieving the logical information units in Web data. As in some previous researches, we adopt minimal subgraph semantics for conjunctive queries. In our approach, when given a conjunctive query, we try to approximate information units including all the given keywords in the following three steps: (1) we distinguish standard route links from the others, (2) we find minimal subgraphs connected via those links and including all the keywords, and (3) we compute the score of each subgraph based on the locality of the keywords within it in order to examine whether it is really a logical information unit relevant to the query. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Proc. of Workshop on Organizaing Wep Space (WOWS’99) in conjunction with ACM DL’99 Berkeley CA USA, Aug. 1999

[1]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[2]  Keishi Tajima,et al.  Finding context paths for Web pages , 1999, Hypertext.

[3]  Ben Shneiderman,et al.  Identifying aggregates in hypertext structures , 1991, HYPERTEXT '91.

[4]  Keishi Tajima,et al.  Cut as a querying unit for WWW, Netnews, and E-mail , 1998, HYPERTEXT '98.

[5]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[8]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[9]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[10]  W. Bruce Croft,et al.  A retrieval model incorporating hypertext links , 1989, Hypertext.

[11]  Katsumi Tanaka,et al.  An interactive classification of Web documents by self-organizing maps and search engines , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[12]  Rick Kazman,et al.  WebQuery: Searching and Visualizing the Web Through Connectivity , 1997, Comput. Networks.

[13]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[14]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[15]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[16]  Yanhong Li Toward A Qualitative Search Engine , 1998, IEEE Internet Comput..

[17]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.