Exploiting Thread Structures to Improve Smoothing of Language Models for Forum Post Retrieval

Due to many unique characteristics of forum data, forum post retrieval is different from traditional document retrieval and web search, raising interesting research questions about how to optimize the accuracy of forum post retrieval. In this paper, we study how to exploit the naturally available raw thread structures of forums to improve retrieval accuracy in the language modeling framework. Specifically, we propose and study two different schemes for smoothing the language model of a forum post based on the thread containing the post. We explore several different variants of the two schemes to exploit thread structures in different ways. We also create a human annotated test data set for forum post retrieval and evaluate the proposed smoothing methods using this data set. The experiment results show that the proposed methods for leveraging forum threads to improve estimation of document language models are effective, and they outperform the existing smoothing methods for the forum post retrieval task.

[1]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[2]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[3]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[4]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[5]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[6]  Djoerd Hiemstra,et al.  Statistical Language Models for Intelligent XML Retrieval , 2003, Intelligent Search on XML Data.

[7]  Brian D. Davison,et al.  A classification-based approach to question answering in discussion boards , 2009, SIGIR.

[8]  Richard M. Schwartz,et al.  BBN at TREC7: Using Hidden Markov Models for Information Retrieval , 1998, TREC.

[9]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[10]  Young-In Song,et al.  Finding question-answer pairs from online forums , 2008, SIGIR '08.

[11]  W. Bruce Croft,et al.  Online community search using thread structure , 2009, CIKM.

[12]  Wei-Ying Ma,et al.  Building implicit links from content for forum search , 2006, SIGIR.

[13]  Chen Lin,et al.  Simultaneously modeling semantics and structure of threaded discussions: a sparse coding approach and its applications , 2009, SIGIR.

[14]  Maarten de Rijke,et al.  Using Contextual Information to Improve Search in Email Archives , 2009, ECIR.

[15]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[16]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[17]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[18]  James P. Callan,et al.  Hierarchical Language Models for XML Component Retrieval , 2004, INEX.