论文信息 - Research on Multi-document Summarization Based on LDA Topic Model

Research on Multi-document Summarization Based on LDA Topic Model

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, based on LDA Model, a new method of sentence-ranking is proposed. The method combines topic-distribution of each sentence with topic-importance of the corpus together to calculate the posterior probability of the sentence, and then, based on the posterior probability, it selects sentences to form a summary. Topic-distribution of each sentence represents the likelihood of sentence belonging to each topic and topic-importance represents the degree that the topics cover the significant portion of the corpus. The method highlights the latent topics and optimizes the summarization. Experiment results on the dataset DUC2006 show the advantage of the multi-document summarization algorithm proposed in the paper. ROUGE values are improved compared with those methods, such as LexRank, LDA-SIBS, LDA-PGS.

Qian Chen | Jinqiang Bian | Zengru Jiang

[1] Hans Peter Luhn,et al. The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[2] Xin Liu,et al. Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[3] Satoshi Sekine,et al. A survey for Multi-Document Summarization , 2003, HLT-NAACL 2003.

[4] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[5] Balaraman Ravindran,et al. Latent dirichlet allocation based multi-document summarization , 2008, AND '08.

[6] Dragomir R. Radev,et al. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[7] Jin Wang,et al. Summarization-based Query Expansion in Information Retrieval , 1998, COLING-ACL.