Minimum redundancy and maximum relevance for single and multi-document Arabic text summarization

Automatic text summarization aims to produce summaries for one or more texts using machine techniques. In this paper, we propose a novel statistical summarization system for Arabic texts. Our system uses a clustering algorithm and an adapted discriminant analysis method: mRMR (minimum redundancy and maximum relevance) to score terms. Through mRMR analysis, terms are ranked according to their discriminant and coverage power. Second, we propose a novel sentence extraction algorithm which selects sentences with top ranked terms and maximum diversity. Our system uses minimal language-dependant processing: sentence splitting, tokenization and root extraction. Experimental results on EASC and TAC 2011 MultiLingual datasets showed that our proposed approach is competitive to the state of the art systems.

[1]  Christopher D. Manning,et al.  Word Segmentation of Informal Arabic with Domain Adaptation , 2014, ACL.

[2]  Regina Barzilay,et al.  Sentence Fusion for Multidocument News Summarization , 2005, CL.

[3]  Hassan Mathkour,et al.  Towards a Suitable Rhetorical Representation for Arabic Text Summarization , 2005, iiWAS.

[4]  Dianne P. O'Leary,et al.  Arabic/English Multi-document Summarization with CLASSY - The Past and the Future , 2008, CICLing.

[5]  Rasim M. Alguliyev,et al.  MCMR: Maximum coverage and minimum redundant text summarization model , 2011, Expert Syst. Appl..

[6]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[7]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[8]  David Evans,et al.  Similarity-based Multilingual Multi-Document Summarization , 2005 .

[9]  Mohammed Attia,et al.  Arabic Tokenization System , 2007, SEMITIC@ACL.

[10]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[11]  Florian Boudin,et al.  A Graph-based Approach to Cross-language Multi-document Summarization , 2011, Polibits.

[12]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[13]  Aqil M. Azmi,et al.  A text summarizer for Arabic , 2012, Comput. Speech Lang..

[14]  Rasim M. Alguliyev,et al.  Mr&mr-Sum: Maximum Relevance and Minimum Redundancy Document Summarization Model , 2013, Int. J. Inf. Technol. Decis. Mak..

[15]  Daniel Marcu,et al.  The rhetorical parsing, summarization, and generation of natural language texts , 1998 .

[16]  Philippe Blache,et al.  Automatic Summarization of Arabic Texts based on RST Technique , 2010, ICEIS.

[17]  Hongyan Liu,et al.  The CIST Summarization System at TAC 2010 , 2010, TAC.

[18]  George Giannakopoulos,et al.  TAC2011 MultiLing Pilot Overview , 2011, TAC.

[19]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[20]  George Giannakopoulos,et al.  Multi-document multilingual summarization and evaluation tracks in ACL 2013 MultiLing Workshop , 2013 .

[21]  Hassan Mathkour,et al.  Parsing Arabic Texts Using Rhetorical Structure Theory , 2008 .

[22]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[23]  Guy Lapalme,et al.  Lakhas, an Arabic summarization system , 2004 .

[24]  Udo Kruschwitz,et al.  University of Essex at the TAC 2011 MultiLingual Summarisation Pilot , 2011, TAC.

[25]  Kathleen R. McKeown,et al.  Generating natural language summaries from multiple on-line sources , 1998 .

[26]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[27]  Ibrahim Sobh,et al.  An Optimized Dual Classification System for Arabic Extractive Generic Text Summarization , 2007 .

[28]  John A. Bateman,et al.  Rhetorical structure theory , 2006 .

[29]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[30]  Xiaojun Wan,et al.  Cross-Language Document Summarization Based on Machine Translation Quality Prediction , 2010, ACL.

[31]  Udo Kruschwitz,et al.  Using Mechanical Turk to Create a Corpus of Arabic Summaries , 2010 .

[32]  George Giannakopoulos,et al.  Summarization System Evaluation Variations Based on N-Gram Graphs , 2010, TAC.

[33]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[34]  Udo Kruschwitz,et al.  Experimenting with Automatic Text Summarisation for Arabic , 2009, LTC.

[35]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.