Research on Multi-document Summarization Based on LDA Topic Model

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, based on LDA Model, a new method of sentence-ranking is proposed. The method combines topic-distribution of each sentence with topic-importance of the corpus together to calculate the posterior probability of the sentence, and then, based on the posterior probability, it selects sentences to form a summary. Topic-distribution of each sentence represents the likelihood of sentence belonging to each topic and topic-importance represents the degree that the topics cover the significant portion of the corpus. The method highlights the latent topics and optimizes the summarization. Experiment results on the dataset DUC2006 show the advantage of the multi-document summarization algorithm proposed in the paper. ROUGE values are improved compared with those methods, such as LexRank, LDA-SIBS, LDA-PGS.