Blog feed search with a post index

User generated content forms an important domain for mining knowledge. In this paper, we address the task of blog feed search: to find blogs that are principally devoted to a given topic, as opposed to blogs that merely happen to mention the topic in passing. The large number of blogs makes the blogosphere a challenging domain, both in terms of effectiveness and of storage and retrieval efficiency. We examine the effectiveness of an approach to blog feed search that is based on individual posts as indexing units (instead of full blogs). Working in the setting of a probabilistic language modeling approach to information retrieval, we model the blog feed search task by aggregating over a blogger’s posts to collect evidence of relevance to the topic and persistence of interest in the topic. This approach achieves state-of-the-art performance in terms of effectiveness. We then introduce a two-stage model where a pre-selection of candidate blogs is followed by a ranking step. The model integrates aggressive pruning techniques as well as very lean representations of the contents of blog posts, resulting in substantial gains in efficiency while maintaining effectiveness at a very competitive level.

[1]  M. de Rijke,et al.  Credibility Improves Topical Blog Post Retrieval , 2008, ACL.

[2]  Jaime G. Carbonell,et al.  Retrieval and Feedback Models for Blog Distillation , 2007, TREC.

[3]  M. de Rijke,et al.  A language modeling framework for expert finding , 2009, Inf. Process. Manag..

[4]  van Gerardus Noord,et al.  Special issue: finite state methods in natural language processing , 2003 .

[5]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[6]  Gilad Mishne,et al.  A Study of Blog Search , 2006, ECIR.

[7]  M. de Rijke,et al.  An effective coherence measure to determine topical consistency in user-generated content , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Maarten de Rijke,et al.  Finding Key Bloggers, One Post At A Time , 2008, ECAI.

[9]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[10]  Christian Scheel,et al.  Feed Distillation Using AdaBoost and Topic Maps , 2007, TREC.

[11]  Craig MacDonald,et al.  Overview of the TREC 2009 Blog Track , 2009, TREC.

[12]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[13]  Maarten de Rijke,et al.  Bloggers as experts: feed distillation using expert retrieval models , 2008, SIGIR '08.

[14]  W. Bruce Croft,et al.  Blog site search using resource selection , 2008, CIKM '08.

[15]  Iadh Ounis,et al.  The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection , 2006 .

[16]  Maarten de Rijke,et al.  A Generative Blog Post Retrieval Model that Uses Query Expansion based on External Collections , 2009, ACL/IJCNLP.

[17]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[18]  Iadh Ounis,et al.  Overview of the TREC 2008 Blog Track , 2008, TREC.

[19]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[20]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[21]  Gilad Mishne,et al.  Applied text analytics for blogs , 2007 .

[22]  Kazuhiro Seki,et al.  TREC 2007 Blog Track Experiments at Kobe University , 2007, TREC.

[23]  Craig MacDonald,et al.  Key blog distillation: ranking aggregates , 2008, CIKM '08.

[24]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[25]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[26]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[27]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[28]  K. Fujimura,et al.  BLOGRANGER – A Multi-faceted Blog Search Engine , 2006 .

[29]  W. Bruce Croft,et al.  UMass at TREC 2008 Blog Distillation Task , 2007, TREC.

[30]  Craig MacDonald,et al.  Overview of the TREC 2007 Blog Track , 2007, TREC.

[31]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[32]  Ko Fujimura,et al.  The EigenRumor Algorithm for Ranking Blogs , 2005 .