A Generative Blog Post Retrieval Model that Uses Query Expansion based on External Collections

User generated content is characterized by short, noisy documents, with many spelling errors and unexpected language usage. To bridge the vocabulary gap between the user's information need and documents in a specific user generated content environment, the blogosphere, we apply a form of query expansion, i.e., adding and reweighing query terms. Since the blogosphere is noisy, query expansion on the collection itself is rarely effective but external, edited collections are more suitable. We propose a generative model for expanding queries using external collections in which dependencies between queries, documents, and expansion documents are explicitly modeled. Different instantiations of our model are discussed and make different (in)dependence assumptions. Results using two external collections (news and Wikipedia) show that external expansion for retrieval of user generated content is effective; besides, conditioning the external collection on the query is very beneficial, and making candidate expansion terms dependent on just the document seems sufficient.

[1]  M. de Rijke,et al.  Credibility Improves Topical Blog Post Retrieval , 2008, ACL.

[2]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[3]  Maarten de Rijke,et al.  Blog, Enterprise, and Relevance Feedback , 2008 .

[4]  Claire Fautsch,et al.  UniNE at TREC 2008: Fact and Opinion Retrieval in the Blogsphere , 2008, TREC.

[5]  Ricardo Baeza-Yates,et al.  Improved query difficulty prediction for the web , 2008, CIKM '08.

[6]  Kui-Lam Kwok,et al.  TREC-9 Cross Language, Web and Question-Answering Track Experiments using PIRCS , 2000, TREC.

[7]  Timothy W. Finin,et al.  The BlogVox Opinion Retrieval System , 2006, TREC.

[8]  Gilad Mishne,et al.  A Study of Blog Search , 2006, ECIR.

[9]  Donna K. Harman,et al.  The NRRC reliable information access (RIA) workshop , 2004, SIGIR '04.

[10]  Rong Yan,et al.  Query expansion using probabilistic local feedback with application to multimedia retrieval , 2007, CIKM '07.

[11]  Tetsuya Sakai The use of external text data in cross-language information retrieval based on machine translation , 2002, IEEE International Conference on Systems, Man and Cybernetics.

[12]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[13]  Chris Buckley Why current IR engines fail , 2004, SIGIR '04.

[14]  Markov Random Field , 2010, Encyclopedia of Machine Learning.

[15]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[16]  Clement Yu,et al.  UIC at TREC 2008 Blog Track , 2008 .

[17]  M. de Rijke,et al.  A few examples go a long way: constructing query models from elaborate query formulations , 2008, SIGIR '08.

[18]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[19]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[20]  Wouter Weerkamp Looking at Things Differently Exploring Perspective Recall for Informal Text Retrieval , 2008 .

[21]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[22]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[23]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[24]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[25]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[26]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[27]  M. de Rijke,et al.  Using Coherence-Based Measures to Predict Query Difficulty , 2008, ECIR.

[28]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[29]  H. B. Mitchell Markov Random Fields , 1982 .

[30]  Maarten de Rijke,et al.  Language Modeling Approaches to Blog Postand Feed Finding , 2007, TREC.

[31]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[32]  M. de Rijke,et al.  The University of Amsterdam at TREC 2008: Blog, Enterprise, and Relevance Feedback , 2008 .

[33]  Wei Zhang,et al.  UIC at TREC 2006 Blog Track , 2006, TREC.

[34]  Jaime G. Carbonell,et al.  Retrieval and Feedback Models for Blog Distillation , 2007, TREC.

[35]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[36]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .