Autocorrelation and regularization of query-based information retrieval scores

Query-based information retrieval refers to the process of scoring documents given a short natural language query. Query-based information retrieval systems have been developed to support searching diverse collections such as the world wide web, personal email archives, news corpora, and legal collections. This thesis is motivated by one of the tenets of information retrieval: the cluster hypothesis. We define a design principle based on the cluster hypothesis which states that retrieval scores should be locally consistent. We refer to this design principle as score autocorrelation. Our experiments show that the degree to which retrieval scores satisfy this design principle correlates positively with system performance. We use this result to define a general, black box method for improving the local consistency of a set of retrieval scores. We refer to this process as local score regularization. We demonstrate that regularization consistently and significantly improves retrieval performance for a wide variety of baseline algorithms. Regularization is closely related to classic techniques such as pseudo-relevance feedback and cluster-based retrieval. We demonstrate that the effectiveness of these techniques may be explained by their regularizing behavior. We argue that regularization should be adopted either as a generic post-processing step or as a fundamental design principle for retrieval models.

[1]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[2]  Thomas M. Cover,et al.  Estimation by the nearest neighbor rule , 1968, IEEE Trans. Inf. Theory.

[3]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[4]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[5]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[6]  Ben Carterette,et al.  Learning a ranking from pairwise preferences , 2006, SIGIR '06.

[7]  Ross Wilkinson,et al.  Using the cosine measure in a neural network for document retrieval , 1991, SIGIR '91.

[8]  Iadh Ounis,et al.  Inferring Query Performance Using Pre-retrieval Predictors , 2004, SPIRE.

[9]  Xiaoyan Li Robust relevance-based language models , 2006 .

[10]  Leo Grady,et al.  Isoperimetric graph partitioning for image segmentation , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  D. Griffith Spatial Autocorrelation and Spatial Filtering , 2003 .

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[14]  W. Bruce Croft,et al.  Document quality models for web ad hoc retrieval , 2005, CIKM '05.

[15]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[16]  R. Coifman,et al.  A general framework for adaptive regularization based on diffusion processes on graphs , 2006 .

[17]  Simon Haykin,et al.  On Different Facets of Regularization Theory , 2002, Neural Computation.

[18]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[19]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[20]  Azadeh Shakery,et al.  A probabilistic relevance propagation model for hypertext retrieval , 2006, CIKM '06.

[21]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[22]  Paul Ogilvie Nearest Neighbor Smoothing of Language Models in IR , 2000 .

[23]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[24]  Kui-Lam Kwok A neural network for probabilistic information retrieval , 1989, SIGIR '89.

[25]  Thomas Hofmann,et al.  Semi-supervised Learning on Directed Graphs , 2004, NIPS.

[26]  John D. Lafferty,et al.  Diffusion Kernels on Statistical Manifolds , 2005, J. Mach. Learn. Res..

[27]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[28]  Donna Harman,et al.  The First Text REtrieval Conference (TREC-1) , 1993 .

[29]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[30]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[31]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[32]  Bernhard Schölkopf,et al.  Ranking on Data Manifolds , 2003, NIPS.

[33]  R. Coifman,et al.  Diffusion Wavelets , 2004 .

[34]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[35]  Javed A. Aslam,et al.  Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions , 2007, ECIR.

[36]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[37]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[38]  D. Griffith Spatial Autocorrelation , 2020, Spatial Analysis Methods and Practice.

[39]  Arthur D. Szlam,et al.  Diffusion wavelet packets , 2006 .

[40]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[41]  Mark D. Dunlop The effect of accessing nonmatching documents on relevance feedback , 1997, TOIS.

[42]  Amit Singhal,et al.  Document expansion for speech retrieval , 1999, SIGIR '99.

[43]  Javed A. Aslam,et al.  Relevance score normalization for metasearch , 2001, CIKM '01.

[44]  IJsbrand Jan Aalbersberg,et al.  Incremental relevance feedback , 1992, SIGIR '92.

[45]  James Allan,et al.  Real-time Query Expansion in Relevance Models , 2006 .

[46]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Track. , 2004 .

[47]  James Allan,et al.  Visual interactions with a multidimensional ranked list , 1998, SIGIR '98.

[48]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[49]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[50]  Fernando Diaz,et al.  Pseudo-Aligned Multilingual Corpora , 2007, IJCAI.

[51]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[52]  Paul R. Cohen,et al.  Retrieving documents by plausible inference: a priliminary study , 1988, SIGIR '88.

[53]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[54]  Thorsten Brants,et al.  Multiple Similarity Measures and Source-Pair Information in Story Link Detection , 2004, HLT-NAACL.

[55]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[56]  Wei Dai,et al.  Minimal document set retrieval , 2005, CIKM '05.

[57]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[58]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[59]  Wei-Ying Ma,et al.  Learning an image manifold for retrieval , 2004, MULTIMEDIA '04.

[60]  Peter Willett,et al.  Using interdocument similarity information in document retrieval systems , 1997 .

[61]  Jacques Savoy Ranking Schemes in Hybrid Boolean Systems: A New Approach , 1997, J. Am. Soc. Inf. Sci..

[62]  Tao Qin,et al.  A study of relevance propagation for web search , 2005, SIGIR '05.

[63]  Mikhail Belkin,et al.  Using Manifold Stucture for Partially Labeled Classification , 2002, NIPS.

[64]  Gerard Salton,et al.  On the use of spreading activation methods in automatic information , 1988, SIGIR '88.

[65]  Victor Lavrenko,et al.  A Generative Theory of Relevance , 2008, The Information Retrieval Series.

[66]  Fernando Diaz,et al.  Using temporal profiles of queries for precision prediction , 2004, SIGIR '04.

[67]  R. Bekkerman,et al.  Using Bigrams in Text Categorization , 2003 .

[68]  Richard K. Belew,et al.  Adaptive information retrieval: using a connectionist representation to retrieve and learn about documents , 1989, SIGIR '89.

[69]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[70]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[71]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[72]  W. Bruce Croft,et al.  Ranking robustness: a novel framework to predict query performance , 2006, CIKM '06.

[73]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[74]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[75]  W. Bruce Croft,et al.  A retrieval model incorporating hypertext links , 1989, Hypertext.

[76]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[77]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[78]  Paul Over,et al.  TREC-7 Interactive Track Report , 1998, TREC.

[79]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[80]  C. Genest,et al.  On blest's measure of rank correlation , 2003 .

[81]  Scott Everett Preece A spreading activation network model for information retrieval , 1981 .

[82]  Fabio Crestani,et al.  Application of Spreading Activation Techniques in Information Retrieval , 1997, Artificial Intelligence Review.

[83]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[84]  James Allan,et al.  Interactive information organization: techniques and evaluation , 2001 .

[85]  F. Chung,et al.  Higher eigenvalues and isoperimetric inequalities on Riemannian manifolds and graphs , 2000 .

[86]  James Allan,et al.  Capturing term dependencies using a language model based on sentence trees , 2002, CIKM '02.

[87]  Franz Rendl,et al.  A recipe for semidefinite relaxation for (0,1)-quadratic programming , 1995, J. Glob. Optim..

[88]  F. Chung Laplacians and the Cheeger Inequality for Directed Graphs , 2005 .

[89]  W. Bruce Croft,et al.  Precision prediction based on ranked list coherence , 2006, Information Retrieval.

[90]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[91]  Alan F. Smeaton,et al.  Spanish and Chinese Document Retrieval in TREC-5 , 1996, TREC.

[92]  W. Bruce Croft,et al.  Language Modeling for Information Retrieval , 2010, The Springer International Series on Information Retrieval.

[93]  Edward M Marcotte,et al.  LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. , 2004, Journal of molecular biology.

[94]  Desmond J. Higham,et al.  Condition numbers and their condition numbers , 1995 .

[95]  Robert Krovetz Viewing morphology as an inference process , 2000, Artif. Intell..

[96]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[97]  K. Sparck Jones,et al.  A TEST FOR THE SEPARATION OF RELEVANT AND NON‐RELEVANT DOCUMENTS IN EXPERIMENTAL RETRIEVAL COLLECTIONS , 1973 .

[98]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[99]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[100]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[101]  Czeslaw Danilowicz,et al.  Re-ranking method based on inter-document distances , 2005, Inf. Process. Manag..

[102]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[103]  Luc Devroye,et al.  The uniform convergence of nearest neighbor regression function estimators and their application in optimization , 1978, IEEE Trans. Inf. Theory.

[104]  Oren Kurland,et al.  Inter-Document Similiarities, Language Models, and Ad Hoc Information Retrieval , 2006 .

[105]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[106]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[107]  Elad Yom-Tov,et al.  What makes a query difficult? , 2006, SIGIR.

[108]  Ingemar J. Cox,et al.  On ranking the effectiveness of searches , 2006, SIGIR.

[109]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[110]  U. Feige,et al.  Spectral Graph Theory , 2015 .