Summarizing web forum threads based on a latent topic propagation process

With an increasingly amount of information in web forums, quick comprehension of threads in web forums has become a challenging research problem. To handle this issue, this paper investigates the task of Web Forum Thread Summarization (WFTS), aiming to give a brief statement of each thread that involving multiple dynamic topics. When applied to the task of WFTS, traditional summarization methods are cramped by topic dependencies, topic drifting and text sparseness. Consequently, we explore an unsupervised topic propagation model in this paper, the Post Propagation Model (PPM), to burst through these problems by simultaneously modeling the semantics and the reply relationship existing in each thread. Each post in PPM is considered as a mixture of topics, and a product of Dirichlet distributions in previous posts is employed to model each topic dependencies during the asynchronous discussion. Based on this model, the task of WFTS is accomplished by extracting most significant sentences in a thread. The experimental results on two different forum data sets show that WFTS based on the PPM outperforms several state-of-the-art summarization methods in terms of ROUGE metrics.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[3]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[4]  S. Y. Dennis On the hyper-Dirichlet type 1 and hyper-Liouville distributions , 1991 .

[5]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[6]  Chen Lin,et al.  Simultaneously modeling semantics and structure of threaded discussions: a sparse coding approach and its applications , 2009, SIGIR.

[7]  Gang Wang,et al.  Dynamically Modeling Semantic Dependencies in Web Forum Threads , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[8]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[9]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[10]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[11]  Naonori Ueda,et al.  Topic Tracking Model for Analyzing Consumer Purchase Behavior , 2009, IJCAI.

[12]  Furu Wei,et al.  Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization , 2008, SIGIR '08.

[13]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[14]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[15]  Liang Zhou,et al.  Digesting Virtual "Geek" Culture: The Summarization of Technical Internet Relay Chats , 2005, ACL.

[16]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[17]  Lucy Vanderwende,et al.  Exploring Content Models for Multi-Document Summarization , 2009, NAACL.