A Multi-Document Multi-Lingual Automatic Summarization System

Abstract. In this paper, a new multidocument multi-lingual text summarization technique, based on singular value decomposition and hierarchical clustering, is proposed. The proposed approach relies on only two resources for any language: a word segmentation system and a dictionary of words along with their document frequencies. The summarizer initially takes a collection of related documents, and transforms them into a matrix; it then applies singular value decomposition to the resulted matrix. After using a binary hierarchical clustering algorithm, the most important sentences of the most important clusters form the summary. The appropriate place of each chosen sentence is determined by a novel technique. The system has been successfully tested on summarizIn this paper, a new multidocument multi-lingual text summarization technique, based on singular value decomposition and hierarchical clustering, is proposed. The proposed approach relies on only two resources for any language: a word segmentation system and a dictionary of words along with their document frequencies. The summarizer initially takes a collection of related documents, and transforms them into a matrix; it then applies singular value decomposition to the resulted matrix. After using a binary hierarchical clustering algorithm, the most important sentences of the most important clusters form the summary. The appropriate place of each chosen sentence is determined by a novel technique. The system has been successfully tested on summarizing several Persian document collections.

[1]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[2]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[3]  David Evans,et al.  Tracking and summarizing news on a daily basis with Columbia's Newsblaster , 2002 .

[4]  H. P. Edmundson,et al.  Automatic abstracting and indexing—survey and recommendations , 1961, CACM.

[5]  Kathleen McKeown,et al.  Cut and Paste Based Text Summarization , 2000, ANLP.

[6]  Andrew Hickl,et al.  LCC's GISTexter at DUC 2006: Multi-Strategy Multi-Document Summarization , 2006 .

[7]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[8]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[9]  Daniel Marcu,et al.  Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[10]  Chin-Yew Lin,et al.  From Single to Multi-document Summarization : A Prototype System and its Evaluation , 2002 .

[11]  Eduard H. Hovy,et al.  From Single to Multi-document Summarization , 2002, ACL.

[12]  Vasileios Hatzivassiloglou,et al.  Event-Based Extractive Summarization , 2004 .

[13]  Chris D. Paice,et al.  The identification of important concepts in highly structured technical papers , 1993, SIGIR.

[14]  Zhu Zhang,et al.  NewsInEssence: A System For Domain-Independent, Real-Time News Clustering and Multi-Document Summarization , 2001, HLT.

[15]  Karen Spärck Jones Automatic summarising: factors and directions , 1998, ArXiv.

[16]  Min-Yen Kan,et al.  Customization in a unified framework for summarizing medical literature , 2005, Artif. Intell. Medicine.

[17]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[18]  G Salton,et al.  Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts , 1994, Science.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Karel Jezek,et al.  Text Summarization and Singular Value Decomposition , 2004, ADVIS.