Blogger, stick to your story: modeling topical noise in blogs with coherence measures

Topical noise in blogs arises when bloggers digress from the central topical thrust of their blogs. We introduce a method to explicitly incorporate a model of topical noise into a language modeling approach to the task of blog distillation. Topical noise is integrated into the model using a coherence score, which reflects the tightness of the topical structure of a blog. Tests performed on the TRECBlog06 corpus show that a naive integration of the coherence score as blog prior fails to achieve performance improvements. Instead, we develop a set of more sophisticated models in which the coherence score is weighted by a function of the blog retrieval score. The proposed models help improve effectiveness of our language modeling approach to the blog distillation task.

[1]  Robert B. Allen,et al.  An interface for navigating clustered document sets returned by queries , 1993, COCS '93.

[2]  Kazuhiro Seki,et al.  TREC 2007 Blog Track Experiments at Kobe University , 2007, TREC.

[3]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[4]  M. de Rijke,et al.  Using Coherence-Based Measures to Predict Query Difficulty , 2008, ECIR.

[5]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[6]  K. Fujimura,et al.  BLOGRANGER – A Multi-faceted Blog Search Engine , 2006 .

[7]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[8]  Maarten de Rijke,et al.  Finding Key Bloggers, One Post At A Time , 2008, ECAI.

[9]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[10]  Jaime G. Carbonell,et al.  Retrieval and Feedback Models for Blog Distillation , 2007, TREC.

[11]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[12]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[13]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[14]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[15]  Gilad Mishne,et al.  Applied text analytics for blogs , 2007 .

[16]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[17]  Wouter Weerkamp,et al.  Bloggers as experts , 2008 .

[18]  M. de Rijke,et al.  The University of Amsterdam at the TREC 2007 Blog Track , 2007 .

[19]  Gilad Mishne,et al.  A Study of Blog Search , 2006, ECIR.

[20]  W. Bruce Croft,et al.  Quantifying query ambiguity , 2002 .

[21]  W. Bruce Croft,et al.  UMass at TREC 2008 Blog Distillation Task , 2007, TREC.

[22]  Craig MacDonald,et al.  Overview of the TREC 2007 Blog Track , 2007, TREC.

[23]  Christian Scheel,et al.  Feed Distillation Using AdaBoost and Topic Maps , 2007, TREC.

[24]  Iadh Ounis,et al.  The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection , 2006 .