An effective coherence measure to determine topical consistency in user-generated content

When searching for blogs on a specific topic, information seekers prefer blogs that place a central focus on that topic over blogs whose mention of the topic is diffuse or incidental. In order to present users with better blog feed search results, we developed a measure of topical consistency that is able to capture whether or not a blog is topically focused. The measure, called the coherence score, is inspired by the genetics literature and captures the tightness of the clustering structure of a data set relative to a background collection. In a set of experiments on synthetic data, the coherence score is shown to provide a faithful reflection of topic clustering structure. The properties that make the coherence score more appropriate than lexical cohesion, a common measure of topical structure, are discussed. Retrieval experiments show that integrating the coherence score as a prior in a language modeling-based approach to blog feed search improves retrieval effectiveness. The coherence score must, however, be used judiciously in order to avoid boosting the ranking of irrelevant but topically focused blogs. To this end, we experiment with a series of weighting schemes that adjust the contribution of the coherence score according to the relevance of a blog to the user query. An appropriate weighting scheme is able to improve retrieval performance. Finally, we show that the coherence score can be reliably estimated with a sample exceeding 20 posts in size. Consistent with this finding, experiments show that the best retrieval performance is achieved if coherence scores are used when a blog contains more than 20 posts.

[1]  J. Hartigan Statistical theory in clustering , 1985 .

[2]  Maarten de Rijke,et al.  Finding Key Bloggers, One Post At A Time , 2008, ECAI.

[3]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[4]  Maarten de Rijke,et al.  Bloggers as experts: feed distillation using expert retrieval models , 2008, SIGIR '08.

[5]  M. de Rijke,et al.  Using Coherence-Based Measures to Predict Query Difficulty , 2008, ECIR.

[6]  Kazuhiro Seki,et al.  TREC 2007 Blog Track Experiments at Kobe University , 2007, TREC.

[7]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[8]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[9]  Martha Larson,et al.  Blogger, stick to your story: modeling topical noise in blogs with coherence measures , 2008, AND '08.

[10]  Wouter Weerkamp,et al.  Bloggers as experts , 2008 .

[11]  Jaime G. Carbonell,et al.  Retrieval and Feedback Models for Blog Distillation , 2007, TREC.

[12]  Robert B. Allen,et al.  An interface for navigating clustered document sets returned by queries , 1993, COCS '93.

[13]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[14]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[15]  Gilad Mishne,et al.  A Study of Blog Search , 2006, ECIR.

[16]  Craig MacDonald,et al.  Key blog distillation: ranking aggregates , 2008, CIKM '08.

[17]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[18]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[19]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[20]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[21]  Gilad Mishne,et al.  Applied text analytics for blogs , 2007 .

[22]  Alan F. Smeaton,et al.  Broadcast News Gisting Using Lexical Cohesion Analysis , 2004, ECIR.

[23]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[24]  H. Bock On some significance tests in cluster analysis , 1985 .

[25]  M. de Rijke,et al.  The University of Amsterdam at the TREC 2007 Blog Track , 2007 .

[26]  W. Bruce Croft,et al.  Quantifying query ambiguity , 2002 .

[27]  W. Bruce Croft,et al.  UMass at TREC 2008 Blog Distillation Task , 2007, TREC.

[28]  Craig MacDonald,et al.  Overview of the TREC 2007 Blog Track , 2007, TREC.

[29]  K. Fujimura,et al.  BLOGRANGER – A Multi-faceted Blog Search Engine , 2006 .

[30]  M. de Rijke,et al.  Credibility Improves Topical Blog Post Retrieval , 2008, ACL.

[31]  van Gerardus Noord,et al.  Special issue: finite state methods in natural language processing , 2003 .

[32]  Maarten de Rijke,et al.  A Generative Blog Post Retrieval Model that Uses Query Expansion based on External Collections , 2009, ACL/IJCNLP.

[33]  Jaap Kamps,et al.  The University of Amsterdam at the TREC 2006 Terabyte Track , 2006 .

[34]  Martha Larson,et al.  On the Topical Structure of the Relevance Feedback Set , 2008, LWA.

[35]  David Carmel,et al.  Juru at TREC 2003 - Topic Distillation using Query-Sensitive Tuning and Cohesiveness Filtering , 2003, TREC.

[36]  B. Everitt Unresolved Problems in Cluster Analysis , 1979 .

[37]  Michael Halliday,et al.  Cohesion in English , 1976 .

[38]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[39]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[40]  Christian Scheel,et al.  Feed Distillation Using AdaBoost and Topic Maps , 2007, TREC.

[41]  Iadh Ounis,et al.  The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection , 2006 .