Result diversification based on query-specific cluster ranking

Result diversification is a retrieval strategy for dealing with ambiguous or multi-faceted queries by providing documents that cover as many facets of the query as possible. We propose a result diversification framework based on query-specific clustering and cluster ranking, in which diversification is restricted to documents belonging to clusters that potentially contain a high percentage of relevant documents. Empirical results show that the proposed framework improves the performance of several existing diversification methods. The framework also gives rise to a simple yet effective cluster-based approach to result diversification that selects documents from different clusters to be included in a ranked list in a round robin fashion. We describe a set of experiments aimed at thoroughly analyzing the behavior of the two main components of the proposed diversification framework, ranking and selecting clusters for diversification. Both components have a crucial impact on the overall performance of our framework, but ranking clusters plays a more important role than selecting clusters. We also examine properties that clusters should have in order for our diversification framework to be effective. Most relevant documents should be contained in a small number of high-quality clusters, while there should be no dominantly large clusters. Also, documents from these high-quality clusters should have a diverse content. These properties are strongly correlated with the overall performance of the proposed diversification framework. © 2011 Wiley Periodicals, Inc.

[1]  Birger Larsen,et al.  Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA, August 6-11, 2006 , 2006 .

[2]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[3]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[4]  Oren Kurland,et al.  Inter-Document Similiarities, Language Models, and Ad Hoc Information Retrieval , 2006 .

[5]  Bert R. Boyce,et al.  Beyond topicality : A two stage view of relevance and the retrieval process , 1982, Inf. Process. Manag..

[6]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[7]  W. Bruce Croft,et al.  Evaluating Text Representations for Retrieval of the Best Group of Documents , 2008, ECIR.

[8]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[9]  Ben Carterette,et al.  Probabilistic models of ranking novel documents for faceted topic retrieval , 2009, CIKM.

[10]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[11]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[12]  James Allan,et al.  Using part-of-speech patterns to reduce query ambiguity , 2002, SIGIR '02.

[13]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[14]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[15]  M. de Rijke,et al.  An effective coherence measure to determine topical consistency in user-generated content , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[16]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[17]  S. Robertson The probability ranking principle in IR , 1997 .

[18]  Oren Kurland,et al.  Re-ranking search results using language models of query-specific clusters , 2009, Information Retrieval.

[19]  W. Bruce Croft,et al.  Representing clusters for retrieval , 2006, SIGIR.

[20]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[21]  Thorsten Joachims,et al.  Predicting diverse subsets using structural SVMs , 2008, ICML '08.

[22]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[23]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[24]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[25]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[26]  John D. Lafferty,et al.  A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[27]  Guodong Zhou,et al.  Document re-ranking using cluster validation and label propagation , 2006, CIKM '06.

[28]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[29]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[30]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[31]  Claudio Carpineto,et al.  Mobile information retrieval with search results clustering: Prototypes and evaluations , 2009 .

[32]  Carmel Domshlak,et al.  A rank-aggregation approach to searching for optimal query-specific clusters , 2008, SIGIR '08.

[33]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[34]  William Goffman,et al.  A searching procedure for information retrieval , 1964, Inf. Storage Retr..

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.