Using navigation data to improve IR functions in the context of web search

As part of the process of delivering content, devices like proxies and gateways log valuable information about the activities and navigation patterns of users on the Web. In this study, we consider how this navigation data can be used to improve Web search. A query posted to a search engine together with the set of pages accessed during a search task is known as a search session. We develop a mixture model for the observed set of search sessions, and propose variants of the classical EM algorithm for training. The model itself yields a type of navigation-based query clustering. By implicitly borrowing strength between related queries, the mixture formulation allows us to identify the "highly relevant" URLs for each query cluster. Next, we explore methods for incorporating existing labeled data (the Yahoo! directory, for example) to speed convergence and help resolve low-traffic clusters. Finally, the mixture formulation also provides for a simple, hierarchical display of search results based on the query clusters. The effectiveness of our approach is evaluated using proxy access logs for the outgoing Lucent proxy.

[1]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[2]  Ellen M. Voorhees,et al.  Multiple search engines in database merging , 1997, DL '97.

[3]  Shin Ishii,et al.  On-line EM Algorithm for the Normalized Gaussian Network , 2000, Neural Computation.

[4]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[5]  Giuseppe Attardi,et al.  Theseus: Categorization by Context , 2000 .

[6]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[7]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  W. Scott Spangler,et al.  Clustering hypertext with applications to web searching , 2000, HYPERTEXT '00.

[10]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[11]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[12]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[15]  Paul B. Kantor,et al.  Capturing human intelligence in the net , 2000, CACM.

[16]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.