Using Statistical and Semantic Analysis for Arabic Text Summarization

Automatic text summarization is an essential tool to overcome the problem of information overload. So far this field has not been studied enough for Arabic language and currently only few related works are available. Arabic text summarization is faced with two main issues: how to extract semantic relationships between textual units and deal with redundancy. To overcome these problems, we propose in this paper a hybrid method to generate an extractive summary of Arabic documents. Our approach is based on a two-dimensional undirected and weighted graph with sentences as nodes and each pair of sentences are connected by two edges representing the statistical and semantic similarity measure. The statistical similarity measure builds on the content overlap between two sentences, while the semantic one is based upon semantic information extracted from Arabic WordNet (AWN) ontology. Then, the score of each sentence is computed by performing the ranking algorithm PageRank on the generated graph. Thereafter, the score of each sentence is performed by adding other statistical features of the text such as TF.ISF and sentence position. The final summary is built by selecting the top-ranking sentences. Finally, we deal with redundancy and information diversity issues by using an adapted maximal marginal relevance (MMR) method. Experimental results on EASC dataset show that our proposed approach outperforms some of existing Arabic summarization systems.

[1]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[2]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[3]  Annapurna P. Patil,et al.  Automatic text summarizer , 2014, 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[4]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[5]  Ted Pedersen,et al.  Information Content Measures of Semantic Similarity Perform Better Without Sense-Tagged Text , 2010, NAACL.

[6]  Haisheng Li,et al.  A novel semantic similarity measure within sentences , 2012, Proceedings of 2012 2nd International Conference on Computer Science and Network Technology.

[7]  Subarna Shakya,et al.  Word Sense Disambiguation using WSD specific WordNet of polysemy words , 2014, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).

[8]  Saïd El Alaoui Ouatik,et al.  Arabic text summarization based on graph theory , 2015, 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA).

[9]  Frank S. C. Tseng,et al.  An integration of WordNet and fuzzy association rule mining for multi-label document clustering , 2010, Data Knowl. Eng..

[10]  Euripides G. M. Petrakis,et al.  Semantic similarity methods in wordNet and their application to information retrieval on the web , 2005, WIDM '05.

[11]  Xiao Hua Chen,et al.  A WordNet-based semantic similarity measurement combining edge-counting and information content theory , 2015, Eng. Appl. Artif. Intell..

[12]  Rafael Dueire Lins,et al.  A multi-document summarization system based on statistics and linguistic treatment , 2014, Expert Syst. Appl..

[13]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[14]  Rasim M. Alguliyev,et al.  An unsupervised approach to generating generic summaries of documents , 2015, Appl. Soft Comput..

[15]  Christos Bouras,et al.  A clustering technique for news articles using WordNet , 2012, Knowl. Based Syst..

[16]  Udo Kruschwitz,et al.  Exploring Clustering for Multi-document Arabic Summarisation , 2011, AIRS.

[17]  Udo Kruschwitz,et al.  Using Mechanical Turk to Create a Corpus of Arabic Summaries , 2010 .

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Chin-Yew Lin,et al.  Automated Text Summarization , 2005, IJCNLP.

[20]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[21]  Zakaria Elberrichi,et al.  Arabic text categorization: a comparative study of different representation modes , 2012, Int. Arab J. Inf. Technol..

[22]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[23]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[24]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[25]  Alok Ranjan Pal,et al.  An approach to automatic text summarization using WordNet , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[26]  Philippe Blache,et al.  Minimum redundancy and maximum relevance for single and multi-document Arabic text summarization , 2014, J. King Saud Univ. Comput. Inf. Sci..

[27]  Mohsen Kahani,et al.  Improvement of an abstractive summarization evaluation tool using lexical-semantic relations and weighted syntax tags in Farsi language , 2014, 2014 Iranian Conference on Intelligent Systems (ICIS).

[28]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[29]  Ahmed Ibrahim,et al.  Rhetorical Representation and Vector Representation in Summarizing Arabic Text , 2013, NLDB.

[30]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[31]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[32]  Guy Lapalme,et al.  Lakhas, an Arabic summarization system , 2004 .

[33]  Rahul Malik,et al.  Automatically Selecting Answer Templates to Respond to Customer Emails , 2007, IJCAI.

[34]  Sandeep Kumar Singh,et al.  An improved approach to word sense disambiguation , 2014, 2014 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).