Automatic Arabic text summarization using clustering and keyphrase extraction

As the number of electronic documents increases rapidly, the need for faster techniques to assess the relevance of these documents emerges. A summary is a concise representation of underlying text. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. This paper propose a hybrid clustering method(partitioning and hierarchical) to group many Arabic documents into several clusters .Then keyphrase extraction module is applied to extract important Keyphrases from each cluster, which helps identify the most important sentences and find similar sentences based on several similarity algorithms. It applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences (i.e., sentences that have a greater similarity than the predefined threshold). This model is designed for both single-and multi-document Arabic text summarization. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) matrix used for the evaluation. For the summarization dataset, Essex Arabic Summaries Corpus was used. It has many topic based articles with multiple human summaries. This model achieved an accuracy of 80 % for single-document and 62% for multi-document summarization.

[1]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[2]  Aqil M. Azmi,et al.  A text summarizer for Arabic , 2012, Comput. Speech Lang..

[3]  George A. Vouros,et al.  Summarization system evaluation revisited: N-gram graphs , 2008, TSLP.

[4]  Guy Lapalme,et al.  Lakhas, an Arabic summarization system , 2004 .

[5]  Phipps Arabie,et al.  Some current models for the perception and judgment of risk , 1988 .

[6]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[7]  Udo Kruschwitz,et al.  University of Essex at the TAC 2011 MultiLingual Summarisation Pilot , 2011, TAC.

[8]  P. Willett A comparison of some hierarchal agglomerative clustering algorithms for structure—property correlation , 1982 .

[9]  Udo Kruschwitz,et al.  Multi-document arabic text summarisation , 2011, 2011 3rd Computer Science and Electronic Engineering Conference (CEEC).

[10]  Halil Kilicoglu,et al.  Automatic summarization of MEDLINE citations for evidence-based medical treatment: A topic-oriented evaluation , 2009, J. Biomed. Informatics.

[11]  Mahmoud El-Haj Arabic multi-document text summarisation , 2012 .

[12]  Manabu Okumura,et al.  Supervised automatic evaluation for summarization with voted regression model , 2007, Inf. Process. Manag..

[13]  J. Douglas Carrel,et al.  Some Current Models for the Perception and Judgment of Risk , 2003 .

[14]  J. Dolfi,et al.  Back to basics , 2005, ISSM 2005, IEEE International Symposium on Semiconductor Manufacturing, 2005..

[15]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[16]  Steffen Staab,et al.  Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis , 2005, J. Artif. Intell. Res..

[17]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[18]  S. A. Ouatik,et al.  Stemming and similarity measures for Arabic Documents Clustering , 2010, 2010 5th International Symposium On I/V Communications and Mobile Network.

[19]  Dianne P. O'Leary,et al.  Arabic/English Multi-document Summarization with CLASSY - The Past and the Future , 2008, CICLing.

[20]  Bassam H. Hammo,et al.  Evaluation of Query-Based Arabic Text Summarization System , 2008, 2008 International Conference on Natural Language Processing and Knowledge Engineering.

[21]  Udo Kruschwitz,et al.  Using Mechanical Turk to Create a Corpus of Arabic Summaries , 2010 .

[22]  John M. Conroy,et al.  Back to Basics: CLASSY 2006 , 2006 .