The cluster hypothesis in information retrieval

The cluster hypothesis states that "closely associated docu- ments tend to be relevant to the same requests" (45). This is one of the most fundamental and influential hypotheses in the field of information retrieval and has given rise to a huge body of work. In this tutorial we will present the research topics that have emerged based on the cluster hypothesis. Specific focus will be placed on cluster-based document re- trieval, the use of topic models for ad hoc IR, and the use of graph-based methods that utilize inter-document similarities. Furthermore, we will provide an in-depth survey of the suite of retrieval methods that rely, ei- ther explicitly or implicitly, on the cluster hypothesis and which are used for a variety of different tasks; e.g., query expansion, query-performance prediction, fusion and federated search, and search results diversification. 1T utorial Objectives The primary objective of this tutorial is to present the cluster hypothesis and the lines of research to which it has given rise. To this end, much emphasis will be put on fundamental retrieval techniques and principles that are based on the cluster hypothesis and which have been used for a variety of IR tasks. The more specific goals of the tutorial are to provide attendees with (i) the required background to pursue research in topics that are based on the cluster hypothesis; (ii) an overview of the different tasks for which the cluster hypothesis can be leveraged; and, (iii) fundamental knowledge of the retrieval "toolkit" that was developed based on the cluster hypothesis.

[1]  Oren Kurland,et al.  From "Identical" to "Similar": Fusing Retrieved Lists Based on Inter-document Similarities , 2009, ICTIR.

[2]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[3]  Oren Kurland,et al.  A study of the integration of passage-, document-, and cluster-based information for re-ranking search results , 2011, Information Retrieval.

[4]  M. de Rijke,et al.  Result diversification based on query-specific cluster ranking , 2011, J. Assoc. Inf. Sci. Technol..

[5]  W. Bruce Croft,et al.  Evaluating Text Representations for Retrieval of the Best Group of Documents , 2008, ECIR.

[6]  James Allan,et al.  Evaluating a Visual Navigation System for a Digital Library , 1998, ECDL.

[7]  W. Bruce Croft,et al.  Geometric representations for multiple documents , 2010, SIGIR.

[8]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[9]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[10]  Peter Willett Query-specific automatic document classification , 1985 .

[11]  Oren Kurland,et al.  Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models , 2006, SIGIR.

[12]  James Allan,et al.  Evaluating topic models for information retrieval , 2008, CIKM '08.

[13]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[14]  Amit Singhal,et al.  Document expansion for speech retrieval , 1999, SIGIR '99.

[15]  Peter Willett,et al.  Techniques for the measurement of clustering tendency in document retrieval systems , 1987, J. Inf. Sci..

[16]  Guodong Zhou,et al.  Document re-ranking using cluster validation and label propagation , 2006, CIKM '06.

[17]  Fernando D. Diaz A method for transferring retrieval scores between collections with non-overlapping vocabularies , 2008, SIGIR '08.

[18]  L. Azzopardi,et al.  Topic based language models for ad hoc information retrieval , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[19]  Key-Sun Choi,et al.  Re-ranking model based on document clusters , 2001, Inf. Process. Manag..

[20]  Oren Kurland,et al.  Cluster-based query expansion , 2009, SIGIR.

[21]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[22]  Peter Willett,et al.  Hierarchic document classification using Ward's clustering method , 1986, SIGIR '86.

[23]  Peter Willett,et al.  Hierarchic Document Clustering Using Ward's Method. , 1986, SIGIR 1986.

[24]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[25]  Carmel Domshlak,et al.  A rank-aggregation approach to searching for optimal query-specific clusters , 2008, SIGIR '08.

[26]  Fernando Diaz,et al.  Regularizing ad hoc retrieval scores , 2005, CIKM '05.

[27]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[28]  Xiaojin Zhu,et al.  Improving Diversity in Ranking using Absorbing Random Walks , 2007, NAACL.

[29]  C. Danilowicz,et al.  Document ranking based upon Markov chains , 2001 .

[30]  Oren Kurland,et al.  Query-performance prediction and cluster ranking: two sides of the same coin , 2012, CIKM.

[31]  Benno Stein,et al.  The optimum clustering framework: implementing the cluster hypothesis , 2011, Information Retrieval.

[32]  Oren Kurland,et al.  Utilizing inter-passage and inter-document similarities for reranking search results , 2010, ACM Trans. Inf. Syst..

[33]  C. J. van Rijsbergen,et al.  Query-Sensitive Similarity Measures for Information Retrieval , 2003, Knowledge and Information Systems.

[34]  Oren Kurland,et al.  Re-ranking search results using language models of query-specific clusters , 2009, Information Retrieval.

[35]  Oren Kurland,et al.  Utilizing inter-document similarities in federated search , 2012, SIGIR '12.

[36]  Oren Kurland,et al.  Cluster-based fusion of retrieved lists , 2011, SIGIR.

[37]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[38]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[39]  James Allan,et al.  A New Measure of the Cluster Hypothesis , 2009, ICTIR.

[40]  Ellen M. Vdorhees The cluster hypothesis revisited , 1985, SIGIR 1985.

[41]  Fernando Diaz,et al.  Performance prediction using spatial autocorrelation , 2007, SIGIR.

[42]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[43]  Oren Kurland,et al.  Exploring the cluster hypothesis, and cluster-based retrieval, over the web , 2012, CIKM '12.

[44]  James Allan,et al.  A cluster-based resampling method for pseudo-relevance feedback , 2008, SIGIR '08.

[45]  Ingemar J. Cox,et al.  On ranking the effectiveness of searches , 2006, SIGIR.

[46]  Czeslaw Danilowicz,et al.  Re-ranking method based on inter-document distances , 2005, Inf. Process. Manag..

[47]  Oren Kurland,et al.  The opposite of smoothing: a language model approach to ranking query-specific document clusters , 2008, SIGIR '08.

[48]  Oren Kurland,et al.  Re-ranking search results using document-passage graphs , 2008, SIGIR '08.

[49]  David R. Karger,et al.  Scatter/Gather as a Tool for the Navigation of Retrieval Results , 1995 .

[50]  Oren Kurland,et al.  Ranking document clusters using markov random fields , 2013, SIGIR.