Exploiting site-level information to improve web search

Ranking Web search results has long evolved beyond simple bag-of-words retrieval models. Modern search engines routinely employ machine learning ranking that relies on exogenous relevance signals. Yet the majority of current methods still evaluate each Web page out of context. In this work, we introduce a novel source of relevance information for Web search by evaluating each page in the context of its host Web site. For this purpose, we devise two strategies for compactly representing entire Web sites. We formalize our approach by building two indices, a traditional page index and a new site index, where each "document" represents the an entire Web site. At runtime, a query is first executed against both indices, and then the final page score for a given query is produced by combining the scores of the page and its site. Experimental results carried out on a large-scale Web search test collection from a major commercial search engine confirm the proposed approach leads to consistent and significant improvements in retrieval effectiveness.

[1]  Jasmine Novak,et al.  Building enriched document representations using aggregated anchor text , 2009, SIGIR.

[2]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[3]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[4]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[5]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[6]  Fernando Aguiar,et al.  Improving Web Search by the Identification of Contextual Information , 2003, Intelligent Exploration of the Web.

[7]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[10]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[11]  Azadeh Shakery,et al.  Smoothing document language models with probabilistic term count propagation , 2008, Information Retrieval.

[12]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[13]  Donald Metzler,et al.  Beyond bags of words: effectively modeling dependence and features in information retrieval , 2008, SIGF.

[14]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[15]  Tao Qin,et al.  A study of relevance propagation for web search , 2005, SIGIR '05.