Term level search result diversification

Current approaches for search result diversification have been categorized as either implicit or explicit. The implicit approach assumes each document represents its own topic, and promotes diversity by selecting documents for different topics based on the difference of their vocabulary. On the other hand, the explicit approach models the set of query topics, or aspects. While the former approach is generally less effective, the latter usually depends on a manually created description of the query aspects, the automatic construction of which has proven difficult. This paper introduces a new approach: term-level diversification. Instead of modeling the set of query aspects, which are typically represented as coherent groups of terms, our approach uses terms without the grouping. Our results on the ClueWeb collection show that the grouping of topic terms provides very little benefit to diversification compared to simply using the terms themselves. Consequently, we demonstrate that term-level diversification, with topic terms identified automatically from the search results using a simple greedy algorithm, significantly outperforms methods that attempt to create a full topic structure for diversification.

[1]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[2]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[3]  Charles L. A. Clarke,et al.  An Effectiveness Measure for Ambiguous and Underspecified Queries , 2009, ICTIR.

[4]  Filip Radlinski,et al.  Inferring query intent from reformulations and clicks , 2010, WWW '10.

[5]  Arnold L. Rosenberg,et al.  Finding topic words for hierarchical summarization , 2001, SIGIR '01.

[6]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[7]  Arjen P. de Vries,et al.  Combining implicit and explicit topic representations for result diversification , 2012, SIGIR '12.

[8]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[9]  W. Bruce Croft,et al.  Generating hierarchical summaries for web searches , 2003, SIGIR '03.

[10]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[11]  Tapas Kanungo,et al.  Predicting the readability of short web summaries , 2009, WSDM '09.

[12]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[13]  Ben Carterette,et al.  Probabilistic models of ranking novel documents for faceted topic retrieval , 2009, CIKM.

[14]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[15]  W. Bruce Croft,et al.  Diversity by proportionality: an election-based approach to search result diversification , 2012, SIGIR '12.

[16]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[17]  W. Bruce Croft,et al.  UMass at TREC 2010 Web Track : Term Dependence , Spam Filtering and Quality Bias , 2010 .

[18]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[19]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[20]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[21]  Ji-Rong Wen,et al.  Multi-dimensional search result diversification , 2011, WSDM '11.

[22]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[23]  W. Bruce Croft,et al.  Inferring query aspects from reformulations using clustering , 2011, CIKM '11.

[24]  W. Bruce Croft,et al.  TREC 2010 Web Track Notebook: Term Dependence, Spam Filtering and Quality Bias , 2010, TREC.

[25]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track | NIST , 2011 .

[26]  Hong Cheng,et al.  Coverage-based search result diversification , 2012, Information Retrieval.

[27]  John D. Lafferty,et al.  Beyond independent relevance: methods and evaluation metrics for subtopic retrieval , 2003, SIGIR.

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[29]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[30]  Jun Wang,et al.  Portfolio theory of information retrieval , 2009, SIGIR.

[31]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.