Web Page Clustering Using Heuristic Search in the Web Graph

Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, we exploit the thematic locality of the Web--relevant Web pages are often located close to each other in the Web graph of hyperlinks. We estimate the level of relevance between each pair of retrieved pages by the length of a path between them. The path is constructed using multi-agent beam search: each agent starts with one Web page and attempts to meet as many other agents as possible with some bounded resources. We test the system on two types of queries: ambiguous English words and people names. The Web appears to be tightly connected; about 70% of the agents meet with each other after only three iterations of exhaustive breadth-first search. However, when heuristics are applied, the search becomes more focused and the obtained results are substantially more accurate. Combined with a content-driven Web page clustering technique, our heuristic search system significantly improves the clustering results.

[1]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[2]  Shaul Markovitch,et al.  Multiple-goal search algorithms and their application to web crawling , 2002, AAAI/IAAI.

[3]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[4]  Chris H. Q. Ding,et al.  Web document clustering using hyperlink structures , 2001, Comput. Stat. Data Anal..

[5]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[6]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[7]  C. J. van Rijsbergen,et al.  An Algorithm for Information Structuring and Retrieval , 1971, Comput. J..

[8]  James Allan,et al.  Interactive Information Retrieval Using Clustering and Spatial Proximity , 2004, User Modeling and User-Adapted Interaction.

[9]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[10]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[11]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[12]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[13]  Judea Pearl,et al.  Heuristics : intelligent search strategies for computer problem solving , 1984 .

[14]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[15]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[18]  Ted Pedersen,et al.  Abbreviation and Acronym Disambiguation in Clinical Discourse , 2005, AMIA.