Cluster-based retrieval using language models

Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. We show that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.

[1]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[2]  W. Bruce Croft,et al.  Document clustering: An evaluation of some experiments with the cranfield 1400 collection , 1975, Inf. Process. Manag..

[3]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[4]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[5]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[6]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[7]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[8]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[9]  Jonathan Yamron,et al.  Topic Tracking in a News Stream , 1999 .

[10]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[11]  Peter Willett,et al.  Using interdocument similarity information in document retrieval systems , 1997 .

[12]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[13]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[14]  Yi Zhang,et al.  Novelty and redundancy detection in adaptive filtering , 2002, SIGIR '02.

[15]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[16]  Larry Gillick,et al.  Text segmentation and topic tracking on broadcast news via a hidden Markov model approach , 1998, ICSLP.

[17]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[18]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[19]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[20]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[21]  Peter Jansen,et al.  Effectiveness of Clustering in Ad-Hoc Retrieval , 1998, TREC.

[22]  Ellen M. Vdorhees The cluster hypothesis revisited , 1985, SIGIR 1985.

[23]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[24]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[25]  Peter Willett Query-specific automatic document classification , 1985 .

[26]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[27]  Wessel Kraaij,et al.  TNO at TDT2001: Language Model-Based Topic Detection , 2001 .