Retrieval and feedback models for blog feed search

Blog feed search poses different and interesting challenges from traditional ad hoc document retrieval. The units of retrieval, the blogs, are collections of documents, the blog posts. In this work we adapt a state-of-the-art federated search model to the feed retrieval task, showing a significant improvement over algorithms based on the best performing submissions in the TREC 2007 Blog Distillation task[12]. We also show that typical query expansion techniques such as pseudo-relevance feedback using the blog corpus do not provide any significant performance improvement and in many cases dramatically hurt performance. We perform an in-depth analysis of the behavior of pseudo-relevance feedback for this task and develop a novel query expansion technique using the link structure in Wikipedia. This query expansion technique provides significant and consistent performance improvements for this task, yielding a 22% and 14% improvement in MAP over the unexpanded query for our baseline and federated algorithms respectively.

[1]  Jaime G. Carbonell,et al.  Retrieval and Feedback Models for Blog Distillation , 2007, TREC.

[2]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[3]  Iadh Ounis,et al.  The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection , 2006 .

[4]  Pawan Kumar,et al.  Notice of Violation of IEEE Publication Principles The Anatomy of a Large-Scale Hyper Textual Web Search Engine , 2009 .

[5]  W. Bruce Croft,et al.  Indri at TREC 2004: Terabyte Track , 2004, TREC.

[6]  Nick Craswell,et al.  Overview of the TREC 2006 Enterprise Track , 2006, TREC.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[9]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[10]  W. Bruce Croft,et al.  UMass at TREC 2008 Blog Distillation Task , 2007, TREC.

[11]  Craig MacDonald,et al.  Overview of the TREC 2007 Blog Track , 2007, TREC.

[12]  Timothy W. Finin,et al.  Characterizing the Splogosphere , 2006, WWW 2006.

[13]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[14]  Iadh Ounis,et al.  University of Glasgow at TREC 2006: Experiments in Terabyte and Enterprise Tracks with Terrier , 2006, TREC.

[15]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[16]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[17]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[18]  Craig MacDonald,et al.  University of Glasgow at TREC 2007: Experiments in Blog and Enterprise Tracks with Terrier , 2007, TREC.

[19]  W. Bruce Croft,et al.  Indri at TREC 2005: Terabyte Track , 2005, TREC.

[20]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[21]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.