Two-Phase Multidocument Summarization Through Content-Attention-Based Subtopic Detection

Multidocument summarization problem deals with extracting main information and ideas from a set of related documents. Solution to this problem is to find an extraction strategy that aims at finding a small subset of sentences that is able to cover the most important information about the whole document set. Although a large number of machine-learning-based methods have shown great promise, the lack of high-quality training data poses an inherent obstacle to them. Furthermore, because of the proliferation of low-quality documents on the Internet, the existing summarization strategies, which are merely based on statistical features, get poor performance. In this article, we propose a new two-phase multidocument summarization strategy using content attention-based subtopic detection. First, inspired by distance dynamics-based community detection mechanism, we extract subtopics from the set of documents by having insight into their own content attention and also underlying semantic relations. Instead of complicated neural attention mechanisms, we propose a simple iteration-based content attention method to complete the subtopic detection task. Second, we formulate summarization from different subtopics as a combinatorial optimization problem of minimizing sentence distance and maximizing topic diversity. We prove the submodularity of the above optimization problem, which allows us to propose a new multidocument summarization algorithm based on the greedy mechanism. Finally, we experimentally validate our new algorithms on BBC news summary and wikiHow data. The results show our new algorithms outperform the state-of-the-art methods.