Evaluating subtopic retrieval methods: Clustering versus diversification of search results

To address the inability of current ranking systems to support subtopic retrieval, two main post-processing techniques of search results have been investigated: clustering and diversification. In this paper we present a comparative study of their performance, using a set of complementary evaluation measures that can be applied to both partitions and ranked lists, and two specialized test collections focusing on broad and ambiguous queries, respectively. The main finding of our experiments is that diversification of top hits is more useful for quick coverage of distinct subtopics whereas clustering is better for full retrieval of single subtopics, with a better balance in performance achieved through generating multiple subsets of diverse search results. We also found that there is little scope for improvement over the search engine baseline unless we are interested in strict full-subtopic retrieval, and that search results clustering methods do not perform well on queries with low divergence subtopics, mainly due to the difficulty of generating discriminative cluster labels.

[1]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[2]  Claudio Carpineto,et al.  Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[3]  Guido Zuccon,et al.  Using the Quantum Probability Ranking Principle to Rank Interdependent Documents , 2010, ECIR.

[4]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[5]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[6]  Wei Dai,et al.  Minimal document set retrieval , 2005, CIKM '05.

[7]  Claudio Carpineto,et al.  Optimal meta search results clustering , 2010, SIGIR.

[8]  Amanda Spink,et al.  Determining the informational, navigational, and transactional intent of Web queries , 2008, Inf. Process. Manag..

[9]  Michael D. Gordon,et al.  When is the probability ranking principle suboptimal , 1992 .

[10]  Paul Clough,et al.  Developing a Test Collection to Support Diversity Analysis , 2009 .

[11]  Craig MacDonald,et al.  Explicit Search Result Diversification through Sub-queries , 2010, ECIR.

[12]  Yunjie Xu,et al.  Novelty and topicality in interactive information retrieval , 2008 .

[13]  Claudio Carpineto,et al.  Concept data analysis - theory and applications , 2004 .

[14]  Leif Azzopardi Usage based effectiveness measures: monitoring application performance in information retrieval , 2009, CIKM.

[15]  Jonghun Park,et al.  A Scoring Function for Retrieving Photo Sets with Broad Topic Coverage , 2009, 2009 Fifth International Joint Conference on INC, IMS and IDC.

[16]  Hermann Ney,et al.  Jointly optimising relevance and diversity in image retrieval , 2009, CIVR '09.

[17]  Darko Kirovski,et al.  Essential Pages , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[18]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[19]  Brigitte Bigi,et al.  Using Kullback-Leibler Distance for Text Categorization , 2003, ECIR.

[20]  Arne Andersson,et al.  Suffix Trees on Words , 1996, Algorithmica.

[21]  Paul Clough,et al.  Creating a test collection to evaluate diversity in image retrieval , 2008, SIGIR 2008.

[22]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[23]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[24]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[25]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[26]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2008, Softw. Pract. Exp..

[27]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.

[28]  Claudio Carpineto,et al.  Mobile information retrieval with search results clustering: Prototypes and evaluations , 2009 .

[29]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[30]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[31]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[32]  Amanda Spink,et al.  Web searching on the Vivisimo search engine , 2006, J. Assoc. Inf. Sci. Technol..

[33]  Roberto Navigli,et al.  Inducing Word Senses to Improve Web Search Result Clustering , 2010, EMNLP.

[34]  Jun Wang,et al.  Portfolio theory of information retrieval , 2009, SIGIR.

[35]  S. Robertson The probability ranking principle in IR , 1997 .

[36]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[37]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[38]  Filip Radlinski,et al.  Improving personalized web search using result diversification , 2006, SIGIR.

[39]  Claudio Carpineto,et al.  An information-theoretic approach to automatic query expansion , 2001, TOIS.

[40]  Özgür Ulusoy,et al.  Incremental cluster-based retrieval using compressed cluster-skipping inverted files , 2008, TOIS.

[41]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[42]  Emilio Di Giacomo,et al.  Graph Visualization Techniques for Web Clustering Engines , 2007, IEEE Transactions on Visualization and Computer Graphics.

[43]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .