An Efficient Approach for Multi-Sentence Compression

Multi Sentence Compression (MSC) is of great value to many real world applications, such as guided microblog summarization, opinion summarization and newswire summarization. Recently, word graph-based approaches have been proposed and become popular in MSC. Their key assumption is that redundancy among a set of related sentences provides a reliable way to generate informative and grammatical sentences. In this paper, we propose an effective approach to enhance the word graph-based MSC and tackle the issue that most of the state-of-the-art MSC approaches are confronted with: i.e., improving both informativity and grammaticality at the same time. Our approach consists of three main components: (1) a merging method based on Multiword Expressions (MWE); (2) a mapping strategy based on synonymy between words; (3) a re-ranking step to identify the best compression candidates generated using a POS-based language model (POS-LM). We demonstrate the effectiveness of this novel approach using a dataset made of clusters of English newswire sentences. The observed improvements on informativity and grammaticality of the generated compressions show an up to 44% error reduction over state-of-the-art MSC systems.

[1]  Katja Filippova,et al.  Multi-Sentence Compression: Finding Shortest Paths in Word Graphs , 2010, COLING.

[2]  Maja Popovic Morpheme- and POS-based IBM1 and language model scores for translation quality estimation , 2012, WMT@NAACL-HLT.

[3]  Mirella Lapata,et al.  Modelling Compression with Discourse Constraints , 2007, EMNLP.

[4]  Ulf Brefeld,et al.  Learning to Summarise Related Sentences , 2014, COLING.

[5]  Ryan T. McDonald Discriminative Sentence Compression with Soft Syntactic Evidence , 2006, EACL.

[6]  Hoa Trang Dang,et al.  Overview of DUC 2005 , 2005 .

[7]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[8]  Philipp Koehn,et al.  Towards better Machine Translation Quality for the German-English Language Pairs , 2008, WMT@ACL.

[9]  Jugal K. Kalita,et al.  Experiments in Microblog Summarization , 2010, 2010 IEEE Second International Conference on Social Computing.

[10]  Mirella Lapata,et al.  Models for Sentence Compression: A Comparison across Domains, Training Requirements and Evaluation Measures , 2006, ACL.

[11]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[12]  Mark A. Finlayson,et al.  jMWE: A Java Toolkit for Detecting Multi-Word Expressions , 2011, MWE@ACL.

[13]  Jiawei Han,et al.  Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions , 2010, COLING.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[16]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[17]  Kathleen McKeown,et al.  Lexicalized Markov Grammars for Sentence Compression , 2007, NAACL.

[18]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[19]  Christof Monz,et al.  Statistical Machine Translation with Local Language Models , 2011, EMNLP.

[20]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[21]  Hongyan Jing,et al.  Sentence Reduction for Automatic Text Summarization , 2000, ANLP.

[22]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[23]  Micha Elsner,et al.  Learning to Fuse Disparate Sentences , 2011, Monolingual@ACL.

[24]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[25]  Chris H. Q. Ding,et al.  Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization , 2008, SIGIR '08.

[26]  Florian Boudin,et al.  Keyphrase Extraction for N-best Reranking in Multi-Sentence Compression , 2013, HLT-NAACL.

[27]  Aline Villavicencio,et al.  Identification and Treatment of Multiword Expressions Applied to Information Retrieval , 2011, MWE@ACL.

[28]  Peter A. Heeman,et al.  POS Tagging versus Classes in Language Modeling , 1998, VLC@COLING/ACL.

[29]  Michael Strube,et al.  Sentence Fusion via Dependency Graph Compression , 2008, EMNLP.

[30]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.