Clusters, language models, and ad hoc information retrieval

The language-modeling approach to information retrieval provides an effective statistical framework for tackling various problems and often achieves impressive empirical performance. However, most previous work on language models for information retrieval focused on document-specific characteristics, and therefore did not take into account the structure of the surrounding corpus, a potentially rich source of additional information. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in terms of mean average precision (MAP) and recall, and our new interpolation algorithm posts statistically significant performance improvements for both metrics over all six corpora tested. An important aspect of our work is the way we model corpus structure. In contrast to most previous work on cluster-based retrieval that partitions the corpus, we demonstrate the effectiveness of a simple strategy based on a nearest-neighbors approach that produces overlapping clusters.

[1]  Mounia Lalmas,et al.  A survey on the use of relevance feedback for information access systems , 2003, The Knowledge Engineering Review.

[2]  Peter Willett Query-specific automatic document classification , 1985 .

[3]  W. Bruce Croft,et al.  A Language Modeling Framework for Selective Query Expansion , 2004 .

[4]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[5]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[6]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[7]  Amit Singhal,et al.  Document expansion for speech retrieval , 1999, SIGIR '99.

[8]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[9]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .

[10]  W. Bruce Croft,et al.  Relevance Models in Information Retrieval , 2003 .

[11]  John C. Henderson,et al.  Direct Maximization of Average Precision by Hill-Climbing, with a Comparison to a Maximum Entropy Approach , 2004, HLT-NAACL.

[12]  W. Bruce Croft,et al.  Representing clusters for retrieval , 2006, SIGIR.

[13]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[14]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures vs. dynamic cache models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[16]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[17]  Fernando Diaz,et al.  UMass Robust 2005: Using Mixtures of Relevance Models for Query Expansion , 2005, TREC.

[18]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[19]  Wessel Kraaij,et al.  TNO at TDT2001: Language Model-Based Topic Detection , 2001 .

[20]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[21]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[22]  Victor Lavrenko,et al.  Optimal Mixture Models in IR , 2002, ECIR.

[23]  Czelsaw Daniowicz,et al.  Document ranking based upon Markov chains , 2001, Inf. Process. Manag..

[24]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[25]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[26]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[27]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[28]  Victor Lavrenko Localized Smoothing for Multinomial Language Models , 2000 .

[29]  James Allan,et al.  INQUERY and TREC-8 , 1998, TREC.

[30]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[31]  Fernando Diaz,et al.  Regularizing ad hoc retrieval scores , 2005, CIKM '05.

[32]  Peter Willett,et al.  Using interdocument similarity information in document retrieval systems , 1997 .

[33]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[34]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[35]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[36]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[37]  Oren Kurland,et al.  Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models , 2006, SIGIR.

[38]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[39]  Wessel Kraaij,et al.  TRACKING The importance of score normalization , 2003 .

[40]  Kenney Ng A Maximum Likelihood Ratio Information Retrieval Model , 1999, TREC.

[41]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[42]  Gert Vegter,et al.  In handbook of discrete and computational geometry , 1997 .

[43]  KurlandOren,et al.  Clusters, language models, and ad hoc information retrieval , 2009 .

[44]  Oren Kurland,et al.  Inter-Document Similiarities, Language Models, and Ad Hoc Information Retrieval , 2006 .

[45]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[46]  Czeslaw Danilowicz,et al.  Re-ranking method based on inter-document distances , 2005, Inf. Process. Manag..

[47]  James Allan,et al.  Relevance models for topic detection and tracking , 2002 .

[48]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[49]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[50]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[51]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[52]  Ellen M. Vdorhees The cluster hypothesis revisited , 1985, SIGIR 1985.

[53]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[54]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[55]  L. Azzopardi,et al.  Topic based language models for ad hoc information retrieval , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[56]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[57]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[58]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[59]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[60]  Fernando Diaz,et al.  Regularizing query-based retrieval scores , 2007, Information Retrieval.

[61]  ChengXiang Zhai,et al.  Error analysis of difficult TREC topics , 2003, SIGIR '03.

[62]  C. Danilowicz,et al.  Document ranking based upon Markov chains , 2001 .

[63]  W. Bruce Croft,et al.  A general language model for information retrieval (poster abstract) , 1999, SIGIR '99.

[64]  W. Bruce Croft,et al.  Direct Maximization of Rank-Based Metrics for Information Retrieval , 2005 .

[65]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures versus dynamic cache models , 1996, IEEE Trans. Speech Audio Process..

[66]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[67]  Paul Ogilvie Nearest Neighbor Smoothing of Language Models in IR , 2000 .

[68]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.