Topic-Driven Multi-document Summarization

This paper presents a topic-driven framework for generating a generic summary from multi-documents. Our approach is based on the intuition that, from the statistical point of view, the summary’s probability distribution over the topics should be consistent with the multi-documents’ probability distribution over the inherent topics. Here, the topics are defined as weighted “bag-of-words” and derived by Latent Dirichlet Allocation from a collection of documents, either the given multi-documents or a related large-scale corpus. In this sense, we could represent various kinds of text units, such as word, sentence, summary, document and multi-documents, using a single vector space model via their corresponding probability distributions over the derived topics. Therefore, we are able to extract a sentence or summary by calculating the similarity between a sentence/summary and the given multi-documents via their topic probability distributions. In particular, we propose two methods in similarity measurement: the static method and the dynamic method. While the former is employed to detect the salience of information in a static way, the later further controls redundancy in a dynamic way. In addition, we integrate various popular features to improve the performance. Evaluation on the TAC 2008 update summarization task shows encouraging results.