Discovery of Topically Coherent Sentences for Extractive Summarization

Extractive methods for multi-document summarization are mainly governed by information overlap, coherence, and content constraints. We present an unsupervised probabilistic approach to model the hidden abstract concepts across documents as well as the correlation between these concepts, to generate topically coherent and non-redundant summaries. Based on human evaluations our models generate summaries with higher linguistic quality in terms of coherence, readability, and redundancy compared to benchmark systems. Although our system is unsupervised and optimized for topical coherence, we achieve a 44.1 ROUGE on the DUC-07 test set, roughly in the range of state-of-the-art supervised models.

[1]  Xiaojun Wan,et al.  Improved Affinity Graph Based Multi-Document Summarization , 2006, NAACL.

[2]  Michael Gamon,et al.  The PYTHY Summarization System: Microsoft Research at DUC 2007 , 2007 .

[3]  Sanda M. Harabagiu,et al.  Satisfying information needs with multi-document summaries , 2007, Inf. Process. Manag..

[4]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[5]  Dragomir R. Radev,et al.  LexRank: Graph-based Centrality as Salience in Text Summarization , 2004 .

[6]  Juan-Zi Li,et al.  Query-Focused Summarization by Combining Topic Model and Affinity Propagation , 2009, APWeb/WAIM.

[7]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[8]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[9]  Wei Li,et al.  Nonparametric Bayes Pachinko Allocation , 2007, UAI.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Dianne P. O'Leary,et al.  Topic-Focused Multi-Document Summarization Using an Approximate Oracle Score , 2006, ACL.

[12]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[13]  Yihong Gong,et al.  Multi-Document Summarization using Sentence-based Topic Models , 2009, ACL.

[14]  Daniel Marcu,et al.  Bayesian Query-Focused Summarization , 2006, ACL.

[15]  Dilek Z. Hakkani-Tür,et al.  A Hybrid Hierarchical Model for Multi-Document Summarization , 2010, ACL.

[16]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[17]  Ani Nenkova,et al.  A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization , 2006, SIGIR.

[18]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[19]  Regina Barzilay,et al.  Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization , 2004, NAACL.

[20]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[21]  Jie Tang,et al.  Multi-topic Based Query-Oriented Summarization , 2009, SDM.

[22]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[23]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[24]  Ani Nenkova,et al.  The Impact of Frequency on Summarization , 2005 .

[25]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[26]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[27]  Lucy Vanderwende,et al.  Exploring Content Models for Multi-Document Summarization , 2009, NAACL.