A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

Multi-Sentence Compression (MSC) aims to generate a short sentence with the key information from a cluster of similar sentences. MSC enables summarization and question-answering systems to generate outputs combining fully formed sentences from one or several documents. This paper describes an Integer Linear Programming method for MSC using a vertex-labeled graph to select different keywords, with the goal of generating more informative sentences while maintaining their grammaticality. Our system is of good quality and outperforms the state of the art for evaluations led on news datasets in three languages: French, Portuguese and Spanish. We led both automatic and manual evaluations to determine the informativeness and the grammaticality of compressions for each dataset. In additional tests, which take advantage of the fact that the length of compressions can be modulated, we still improve ROUGE scores with shorter output sentences.

[1]  Temel Öncan,et al.  A comparative analysis of several asymmetric traveling salesman problem formulations , 2009, Comput. Oper. Res..

[2]  Stéphane Huet,et al.  Microblog Contextualization using Continuous Space Vectors: Multi-Sentence Compression of Cultural Documents , 2017, CLEF.

[3]  Yllias Chali,et al.  Abstractive Unsupervised Multi-Document Summarization using Paraphrastic Sentence Fusion , 2018, COLING.

[4]  Juan-Manuel Torres-Moreno,et al.  A New Annotated Portuguese/Spanish Corpus for the Multi-Sentence Compression Task , 2018, LREC.

[5]  Jean-Pierre Lorré,et al.  Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization , 2018, ACL.

[6]  Sara Rosenthal,et al.  Time-Efficient Creation of an Accurate Sentence Fusion Corpus , 2010, HLT-NAACL.

[7]  Mohammed Atiquzzaman,et al.  Multi-document abstractive summarization using chunk-graph and recurrent neural network , 2017, 2017 IEEE International Conference on Communications (ICC).

[8]  David Sankoff,et al.  OMG! Orthologs in Multiple Genomes - Competing Graph-Theoretical Formulations , 2011, WABI.

[9]  Phil Blunsom,et al.  Language as a Latent Variable: Discrete Generative Models for Sentence Compression , 2016, EMNLP.

[10]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Pascale Sébillot,et al.  Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition , 2010, Comput. Speech Lang..

[13]  Yang Zhao,et al.  Unsupervised Rewriter for Multi-Sentence Compression , 2019, ACL.

[14]  Lukasz Kaiser,et al.  Sentence Compression by Deletion with LSTMs , 2015, EMNLP.

[15]  Juan-Manuel Torres-Moreno,et al.  Cross-Language Text Summarization Using Sentence and Multi-Sentence Compression , 2018, NLDB.

[16]  Kathleen McKeown,et al.  Supervised Sentence Fusion with Single-Stage Inference , 2013, IJCNLP.

[17]  Mirella Lapata,et al.  Aggregation via Set Partitioning for Natural Language Generation , 2006, NAACL.

[18]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[19]  Juan-Manuel Torres-Moreno,et al.  Multi-Sentence Compression with Word Vertex-Labeled Graphs and Integer Linear Programming , 2018, TextGraphs@NAACL-HLT.

[20]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[21]  Katja Filippova,et al.  Multi-Sentence Compression: Finding Shortest Paths in Word Graphs , 2010, COLING.

[22]  Ulf Brefeld,et al.  Learning to Summarise Related Sentences , 2014, COLING.

[23]  Regina Barzilay,et al.  Sentence Fusion for Multidocument News Summarization , 2005, CL.

[24]  Juan-Manuel Torres-Moreno,et al.  Microblog Contextualization: Advantages and Limitations of a Multi-sentence Compression Approach , 2017, CLEF.

[25]  Christian Komusiewicz,et al.  Evaluation of ILP-Based Approaches for Partitioning into Colorful Components , 2013, SEA.

[26]  Juan-Manuel Torres-Moreno,et al.  Compressive approaches for cross-language multi-document summarization , 2020, Data Knowl. Eng..

[27]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[28]  M. Reape,et al.  Just what is aggregation anyway ? , 2007 .

[29]  Florian Boudin,et al.  Keyphrase Extraction for N-best Reranking in Multi-Sentence Compression , 2013, HLT-NAACL.

[30]  Minh-Quoc Nghiem,et al.  Word Graph-Based Multi-sentence Compression: Re-ranking Candidates Using Frequent Words , 2015, 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE).

[31]  Michael Strube,et al.  Sentence Fusion via Dependency Graph Compression , 2008, EMNLP.

[32]  Fang Chen,et al.  An Efficient Approach for Multi-Sentence Compression , 2016, ACML.

[33]  Juan-Manuel Torres-Moreno,et al.  Cross-Lingual Speech-to-Text Summarization , 2018, MISSI.

[34]  Prasenjit Mitra,et al.  Multi-Document Abstractive Summarization Using ILP Based Multi-Sentence Compression , 2015, IJCAI.

[35]  Chris Callison-Burch,et al.  Evaluating Sentence Compression: Pitfalls and Suggested Remedies , 2011, Monolingual@ACL.

[36]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[37]  J. Clarke,et al.  Global inference for sentence compression : an integer linear programming approach , 2008, J. Artif. Intell. Res..

[38]  Mirella Lapata,et al.  Modelling Compression with Discourse Constraints , 2007, EMNLP.

[39]  Miles Osborne,et al.  Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT '10) , 2010 .