Blog site search using resource selection

A blog site consists of many individual blog postings. Current blog search services focus on retrieving postings but there is also a need to identify relevant blog sites. Blog site search is similar to resource selection in distributed information retrieval, in that the target is to find relevant collections of documents. We introduce resource selection techniques for blog site search and evaluate their performance. Further, we propose a "diversity factor" that measures the topic diversity of each blog site. Our results show that the appropriate combination of the resource selection techniques and the diversity factor can achieve significant improvements in retrieval performance compared to baselines. We also report results using these techniques on the TREC blog distillation task.

[1]  Jaime G. Carbonell,et al.  Retrieval and Feedback Models for Blog Distillation , 2007, TREC.

[2]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[3]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[4]  W. Bruce Croft,et al.  Representing clusters for retrieval , 2006, SIGIR.

[5]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[6]  W. Conover Statistical Methods for Rates and Proportions , 1974 .

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[9]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[10]  Luo Si,et al.  Unified utility maximization framework for resource selection , 2004, CIKM '04.

[11]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[12]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[13]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[14]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[15]  Hong Qu,et al.  Automated Blog Classification: Challenges and Pitfalls , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[16]  Norbert Fuhr,et al.  Evaluating different methods of estimating retrieval quality for resource selection , 2003, SIGIR.

[17]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[18]  Iadh Ounis,et al.  The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection , 2006 .

[19]  W. Bruce Croft,et al.  Evaluating Text Representations for Retrieval of the Best Group of Documents , 2008, ECIR.

[20]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[21]  Gilad Mishne,et al.  A Study of Blog Search , 2006, ECIR.

[22]  W. Bruce Croft,et al.  Topic-Based Language Models for Distributed Retrieval , 2002 .

[23]  W. Bruce Croft,et al.  UMass at TREC 2008 Blog Distillation Task , 2007, TREC.

[24]  Craig MacDonald,et al.  Overview of the TREC 2007 Blog Track , 2007, TREC.

[25]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[26]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .