A Graph-based Approach to Cross-language Multi-document Summarization

Abstract—Cross-language summarization is the task ofgenerating a summary in a language different from the languageof the source documents. In this paper, we propose a graph-basedapproach to multi-document summarization that integratesmachine translation quality scores in the sentence extractionprocess. We evaluate our method on a manually translated subsetof the DUC 2004 evaluation campaign. Results indicate that ourapproach improves the readability of the generated summarieswithout degrading their informativity.Index Terms—Graph-based approach, cross-language multi-document summarization. I. I NTRODUCTION T HE rapid growth and online availability of informationin numerous languages have made cross-languageinformation retrieval and extraction tasks a highly relevantfield of research. Cross-language document summarizationaims at providing a quick access to information expressedin one or more languages. More precisely, this task consistsin producing a summary in one language different from thelanguage of the source documents. In this study, we focuson English to French multi-document summarization. Theprimary motivation is to allow French readers to access theever increasing amount of news available through Englishnews sources.Recent years have shown an increased amount of interestin applying graph theoretic models to Natural LanguageProcessing (NLP) [1]. Graphs are natural ways to encodeinformation for NLP. Entities can be naturally represented asnodes and relations between them can be represented as edges.Graph-based representations of linguistic units as diverse aswords, sentences and documents give rise to efficient solutionsin a variety of tasks ranging from part-of-speech taggingto information extraction, and sentiment analysis. Here, weapply a graph-based ranking algorithm to multi-documentsummarization.A straightforward idea for cross-language summarizationis to translate the summary from one language to the other.

[1]  Constantin Orasan,et al.  Evaluation of a Cross-lingual Romanian-English Multi-document Summariser , 2008, LREC.

[2]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[3]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Philipp Koehn,et al.  Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.

[6]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[7]  Nello Cristianini,et al.  Estimating the Sentence-Level Quality of Machine Translation Systems , 2009, EAMT.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[10]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[11]  Chris Quirk,et al.  Training a Sentence-Level Machine Translation Confidence Measure , 2004, LREC.

[12]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[13]  Eric Wehrli,et al.  A Symbolic Summarizer with 2 Steps of Sentence Selection for TAC 2009 , 2009, TAC.

[14]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[15]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[16]  Kamel Smaïli,et al.  Efficient combination of confidence measures for machine translation , 2009, INTERSPEECH.

[17]  Kam-Fai Wong,et al.  Extractive Summarization Using Supervised and Semi-Supervised Learning , 2008, COLING.

[18]  Alex Kulesza,et al.  Confidence Estimation for Machine Translation , 2004, COLING.

[19]  Rada Mihalcea,et al.  A Language Independent Algorithm for Single and Multiple Document Summarization , 2005, IJCNLP.

[20]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[21]  Xiaojun Wan,et al.  Cross-Language Document Summarization Based on Machine Translation Quality Prediction , 2010, ACL.