Exploiting External Collections for Query Expansion

A persisting challenge in the field of information retrieval is the vocabulary mismatch between a user’s information need and the relevant documents. One way of addressing this issue is to apply query modeling: to add terms to the original query and reweigh the terms. In social media, where documents usually contain creative and noisy language (e.g., spelling and grammatical errors), query modeling proves difficult. To address this, attempts to use external sources for query modeling have been made and seem to be successful. In this article we propose a general generative query expansion model that uses external document collections for term generation: the External Expansion Model (EEM). The main rationale behind our model is our hypothesis that each query requires its own mixture of external collections for expansion and that an expansion model should account for this. For some queries we expect, for example, a news collection to be most beneficial, while for other queries we could benefit more by selecting terms from a general encyclopedia. EEM allows for query-dependent weighing of the external collections. We put our model to the test on the task of blog post retrieval and we use four external collections in our experiments: (i) a news collection, (ii) a Web collection, (iii) Wikipedia, and (iv) a blog post collection. Experiments show that EEM outperforms query expansion on the individual collections, as well as the Mixture of Relevance Models that was previously proposed by Diaz and Metzler [2006]. Extensive analysis of the results shows that our naive approach to estimating query-dependent collection importance works reasonably well and that, when we use “oracle” settings, we see the full potential of our model. We also find that the query-dependent collection importance has more impact on retrieval performance than the independent collection importance (i.e., a collection prior).

[1]  Valentin Jijkoun,et al.  Generating Focused Topic-Specific Sentiment Lexicons , 2010, ACL.

[2]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[3]  Iadh Ounis,et al.  The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection , 2006 .

[4]  M. de Rijke,et al.  A few examples go a long way: constructing query models from elaborate query formulations , 2008, SIGIR '08.

[5]  M. de Rijke,et al.  The University of Amsterdam at the TREC 2007 Blog Track , 2007 .

[6]  M. de Rijke,et al.  Credibility Improves Topical Blog Post Retrieval , 2008, ACL.

[7]  Djoerd Hiemstra,et al.  Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[8]  M. de Rijke,et al.  The University of Amsterdam at TREC 2008: Blog, Enterprise, and Relevance Feedback , 2008 .

[9]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[10]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[11]  F. W. Lancaster,et al.  Information retrieval systems; characteristics, testing, and evaluation , 1968 .

[12]  W. Bruce Croft,et al.  A framework for selective query expansion , 2004, CIKM '04.

[13]  Claudio Carpineto,et al.  Query Difficulty, Robustness, and Selective Application of Query Expansion , 2004, ECIR.

[14]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[15]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[16]  Yang Xu,et al.  Query dependent pseudo-relevance feedback based on wikipedia , 2009, SIGIR.

[17]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[18]  Iadh Ounis,et al.  Finding good feedback documents , 2009, CIKM.

[19]  Claire Fautsch,et al.  UniNE at TREC 2008: Fact and Opinion Retrieval in the Blogsphere , 2008, TREC.

[20]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[21]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[22]  Wouter Weerkamp,et al.  Finding people and their utterances in social media , 2010, SIGIR.

[23]  ChengXiang Zhai,et al.  Adaptive relevance feedback in information retrieval , 2009, CIKM.

[24]  Iadh Ounis,et al.  Overview of the TREC 2008 Blog Track , 2008, TREC.

[25]  Jaime G. Carbonell,et al.  Retrieval and Feedback Models for Blog Distillation , 2007, TREC.

[26]  Zhendong Niu,et al.  Concept Based Query Expansion , 2013, 2013 Ninth International Conference on Semantics, Knowledge and Grids.

[27]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[28]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[29]  Maarten de Rijke,et al.  Length normalization in XML retrieval , 2004, SIGIR '04.

[30]  Milad Shokouhi,et al.  Query Expansion Using External Evidence , 2009, ECIR.

[31]  Marc-Allen Cartright,et al.  UMass Amherst and UT Austin @ the TREC 2009 Relevance Feedback Track , 2009, TREC.

[32]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[33]  Rong Yan,et al.  Query expansion using probabilistic local feedback with application to multimedia retrieval , 2007, CIKM '07.

[34]  Craig MacDonald,et al.  Overview of the TREC 2007 Blog Track , 2007, TREC.

[35]  Clement Yu,et al.  UIC at TREC 2008 Blog Track , 2008 .

[36]  Maarten de Rijke,et al.  A query model based on normalized log-likelihood , 2009, CIKM.

[37]  Milad Shokouhi,et al.  LambdaMerge: merging the results of query reformulations , 2011, WSDM '11.

[38]  Tetsuya Sakai The use of external text data in cross-language information retrieval based on machine translation , 2002, IEEE International Conference on Systems, Man and Cybernetics.

[39]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[40]  M. de Rijke,et al.  Credibility-inspired ranking for blog post retrieval , 2012, Information Retrieval.

[41]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[42]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[43]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[44]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[45]  Kui-Lam Kwok,et al.  TREC-9 Cross Language, Web and Question-Answering Track Experiments using PIRCS , 2000, TREC.

[46]  de RijkeMaarten,et al.  Exploiting External Collections for Query Expansion , 2012 .

[47]  Gilad Mishne,et al.  Towards recency ranking in web search , 2010, WSDM '10.

[48]  Peter Bailey,et al.  ACSys TREC-8 Experiments , 1999, TREC.

[49]  Wei Zhang,et al.  UIC at TREC 2006 Blog Track , 2006, TREC.

[50]  Wouter Weerkamp,et al.  Finding people and their utterances in social media , 2010, SIGIR.

[51]  Timothy W. Finin,et al.  The BlogVox Opinion Retrieval System , 2006, TREC.

[52]  Gilad Mishne,et al.  A Study of Blog Search , 2006, ECIR.

[53]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[54]  Iadh Ounis,et al.  Combining fields for query expansion and adaptive query expansion , 2007, Inf. Process. Manag..