Multilingual Statistical News Summarization

In this chapter we present a generic approach for summarizing clusters of multilingual news articles such as the ones produced by the Europe Media Monitor (EMM) system. Our approach uses robust statistical techniques as well as multilingual tools for named entity recognition and disambiguation to produce entity-centered summaries. We run experiments with the TAC 2008 and 2009 data sets (English corpora for summarization research), and we obtained very promising results; at TAC 2009 our runs attained top rank for linguistic quality and second best for overall responsiveness. We also run a small-scale evaluation on languages other than English, demonstrating thereby the multilinguality of our approach, but also providing interesting evidence that contradicts the pervasive assumption “if it works for English, it works for any language”. Finally, we present an online system currently under development which will eventually incorporate all the elements of the summarization approach discussed hereby and we show sample output summaries in various languages.

[1]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[2]  Mark Last,et al.  A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm , 2010, ACL.

[3]  Steinberger Ralf,et al.  Using Language-independent Rules to Achieve High Multilinguality in Text Mining , 2008 .

[4]  Branimir K. Boguraev,et al.  Salience-based Content Characterisafion of Text Documents , 1997 .

[5]  Karen Spärck Jones Automatic summarising: factors and directions , 1998, ArXiv.

[6]  Erik Van der Goot,et al.  Near real time information mining in multilingual news , 2009, WWW '09.

[7]  Karel Jezek,et al.  Update summarization based on novel topic distribution , 2009, DocEng '09.

[8]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[9]  Jakub Piskorski,et al.  CORLEONE Core Linguistic Entity Online Extraction , 2008 .

[10]  Eduard H. Hovy,et al.  Automated Text Summarization and the SUMMARIST System , 1998, TIPSTER.

[11]  Piskorski Jakub,et al.  Mining Massive Data Sets for Security , 2008 .

[12]  Josef Steinberger,et al.  Improving LSA-based Summarization with Anaphora Resolution , 2005, HLT.

[13]  Steinberger Ralf,et al.  Automatic Construction of Multilingual Name Dictionaries , 2009 .

[14]  Eric SanJuan,et al.  Multilingual Summarization Evaluation without Human Models , 2010, COLING.

[15]  Ani Nenkova,et al.  Can You Summarize This? Identifying Correlates of Input Difficulty for Multi-Document Summarization , 2008, ACL.

[16]  Xiaojun Wan,et al.  Cross-Language Document Summarization Based on Machine Translation Quality Prediction , 2010, ACL.

[17]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[18]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[19]  Jade Goldstein Stewart,et al.  Genre Oriented Summarization , 2009 .

[20]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[21]  Bruno Pouliquen,et al.  Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation , 2006, LREC.

[22]  Mirella Lapata,et al.  Modeling Local Coherence: An Entity-Based Approach , 2005, ACL.

[23]  Paul Over,et al.  DUC in context , 2007, Inf. Process. Manag..

[24]  Josef Steinberger,et al.  Using Parallel Corpora for Multilingual (Multi-document) Summarisation Evaluation , 2010, CLEF.

[25]  Daniel Marcu,et al.  From discourse structures to text summaries , 1997 .

[26]  Ani Nenkova,et al.  Can you summarize this? Identifying correlates of input difficulty for generic multi-document summarization , 2008, ACL 2008.

[27]  Chris H. Q. Ding,et al.  A probabilistic model for Latent Semantic Indexing , 2005, J. Assoc. Inf. Sci. Technol..

[28]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[29]  Dragomir R. Radev,et al.  Generating summaries of multiple news articles , 1995, SIGIR '95.

[30]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[31]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[32]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[33]  Lynette Hirschman,et al.  Appendix F: MUC-7 Coreference Task Definition (version 3.0) , 1998, MUC.

[34]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[35]  Ani Nenkova,et al.  The Pyramid Method: Incorporating human content selection variation in summarization evaluation , 2007, TSLP.

[36]  Mark T. Maybury,et al.  Generating Summaries from Event Data , 1995, Inf. Process. Manag..

[37]  Karel Jezek,et al.  Two uses of anaphora resolution in summarization , 2007, Inf. Process. Manag..

[38]  Josef Steinberger,et al.  Multilingual Statistical News Summarisation: Preliminary Experiments with English , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[39]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[40]  Marc Dymetman,et al.  Learning Machine Translation , 2010 .