Luhn Revisited: Significant Words Language Models

Users tend to articulate their complex information needs in only a few keywords, making underspecified statements of request the main bottleneck for retrieval effectiveness. Taking advantage of feedback information is one of the best ways to enrich the query representation, but can also lead to loss of query focus and harm performance in particular when the initial query retrieves only little relevant information when overfitting to accidental features of the particular observed feedback documents. Inspired by the early work of Luhn [23], we propose significant words language models of feedback documents that capture all, and only, the significant shared terms from feedback documents. We adjust the weights of common terms that are already well explained by the document collection as well as the weight of rare terms that are only explained by specific feedback documents, which eventually results in having only the significant terms left in the feedback model. Our main contributions are the following. First, we present significant words language models as the effective models capturing the essential terms and their probabilities. Second, we apply the resulting models to the relevance feedback task, and see a better performance over the state-of-the-art methods. Third, we see that the estimation method is remarkably robust making the models in- sensitive to noisy non-relevant terms in feedback documents. Our general observation is that the significant words language models more accurately capture relevance by excluding general terms and feedback document specific terms.

[1]  M. de Rijke,et al.  Parsimonious relevance models , 2008, SIGIR '08.

[2]  Allan Hanbury,et al.  Generalizing Translation Models in the Probabilistic Relevance Framework , 2016, CIKM.

[3]  Craig MacDonald,et al.  Expertise drift and query expansion in expert search , 2007, CIKM '07.

[4]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[5]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[6]  Jaap Kamps,et al.  The Healing Power of Poison: Helpful Non-relevant Documents in Feedback , 2016, CIKM.

[7]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[8]  ChengXiang Zhai,et al.  Revisiting the Divergence Minimization Feedback Model , 2014, CIKM.

[9]  ChengXiang Zhai,et al.  A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[10]  Donna K. Harman,et al.  Relevance Feedback and Other Query Modification Techniques , 1992, Information retrieval (Boston).

[11]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[12]  W. Bruce Croft,et al.  Embedding-based Query Language Models , 2016, ICTIR.

[13]  Tao Tao,et al.  A two-stage mixture model for pseudo feedback , 2004, SIGIR '04.

[14]  Azadeh Shakery,et al.  Pseudo-Relevance Feedback Based on Matrix Factorization , 2016, CIKM.

[15]  John D. Lafferty,et al.  Document Language Models, Query Models, and Risk Minimization for Information Retrieval , 2001, SIGIR Forum.

[16]  Kevyn Collins-Thompson,et al.  Estimation and use of uncertainty in pseudo-relevance feedback , 2007, SIGIR.

[17]  C. J. van Rijsbergen,et al.  The selection of good search terms , 1981, Inf. Process. Manag..

[18]  Kevyn Collins-Thompson,et al.  Reducing the risk of query expansion via robust constrained optimization , 2009, CIKM.

[19]  Azadeh Shakery,et al.  Axiomatic Analysis for Improving the Log-Logistic Feedback Model , 2016, SIGIR.

[20]  Djoerd Hiemstra,et al.  The Impact of Positive, Negative and Topical Relevance Feedback , 2008, TREC.

[21]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[22]  Djoerd Hiemstra,et al.  Parsimonious language models for information retrieval , 2004, SIGIR '04.

[23]  Mostafa Dehghani Significant Words Representations of Entities , 2016, SIGIR.

[24]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[25]  ChengXiang Zhai,et al.  Adaptive relevance feedback in information retrieval , 2009, CIKM.

[26]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[27]  Ting Liu,et al.  A review of relevance feedback experiments at the 2003 reliable information access (RIA) workshop. , 2004, SIGIR '04.

[28]  Maarten Marx,et al.  Generalized Group Profiling for Content Customization , 2016, CHIIR.

[29]  C. Buckley,et al.  Overview of the TREC 2010 Relevance Feedback Track ( Notebook ) , 2010 .

[30]  Donna K. Harman,et al.  Overview of the Reliable Information Access Workshop , 2009, Information Retrieval.

[31]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[32]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[33]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[34]  Mounia Lalmas,et al.  A survey on the use of relevance feedback for information access systems , 2003, The Knowledge Engineering Review.

[35]  Iadh Ounis,et al.  Studying Query Expansion Effectiveness , 2009, ECIR.

[36]  Egidio L. Terra,et al.  Poison pills: harmful relevant documents in feedback , 2005, CIKM '05.

[37]  Chris Buckley,et al.  Relevance Feedback Track Overview: TREC 2008 , 2008, TREC.

[38]  Leif Azzopardi,et al.  An analysis on document length retrieval trends in language modeling smoothing , 2008, Information Retrieval.

[39]  Maarten Marx,et al.  On Horizontal and Vertical Separation in Hierarchical Text Classification , 2016, ICTIR.

[40]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[41]  Maarten Marx,et al.  Two-Way Parsimonious Classification Models for Evolving Hierarchies , 2016, CLEF.

[42]  Jaap Kamps,et al.  Parsimonious User and Group Profiling in Venue Recommendation , 2015, TREC.

[43]  C. J. van Rijsbergen,et al.  A New Theoretical Framework for Information Retrieval , 1986, SIGIR Forum.

[44]  Djoerd Hiemstra,et al.  Parsimonious Language Models for a Terabyte of Text , 2007, TREC.

[45]  Iadh Ounis,et al.  Finding good feedback documents , 2009, CIKM.