An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization

Automatic text summarization is considered as an important task in various fields in natural language processing such as information retrieval. It is a process of automatically generating a text representation. Text summarization can be a solution to the problem of information overload. Hence, with the large amount of information available on the Internet, the presentation of a document by a summary helps to get the most relevant result of a search. We propose in this paper a new free Arabic structured corpus in the standard XML TREC format. ANT corpus v2.1 is collected using RSS feeds from different news sources. This corpus is useful for multiple text mining purposes such as generic text summarization, clustering or classification. We test this corpus for an unsupervised single-document extractive summarization using statistical and graph-based language-independent summarizers such as LexRank, TextRank, Luhn and LSA. We investigate the sensitivity of the summarization process to the stemming and stop words removal steps. We evaluate these summarizers performance by comparing the extracted texts fragments to the abstracts existing in ANT corpus v2.1 using ROUGE and BLEU metrics. Experimental results show that LexRank summarizer has achieved the best scores for the ROUGE metric using the stop words removal scenario.

[1]  Qasem A. Al-Radaideh,et al.  A Hybrid Approach for Arabic Text Summarization Using Domain Knowledge and Genetic Algorithms , 2018, Cognitive Computation.

[2]  Di Wang,et al.  Automatic Arabic Summarization: A survey of methodologies and systems , 2017, ACLING.

[3]  Aqil M. Azmi,et al.  An abstractive Arabic text summarizer with user controlled granularity , 2018, Inf. Process. Manag..

[4]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[5]  Ibrahim Bounhas,et al.  Organizing Contextual Knowledge for Arabic Text Disambiguation and Terminology Extraction , 2011 .

[6]  Ahmed Guessoum,et al.  A Supervised Approach to Arabic Text Summarization Using AdaBoost , 2015, WorldCIST.

[7]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[8]  Eduard Hovy,et al.  Manual and automatic evaluation of summaries , 2002, ACL 2002.

[9]  Bilel Elayeb,et al.  Automatic Arabic Text Summarization Using Analogical Proportions , 2020, Cognitive Computation.

[10]  Tarek El-Shishtawy,et al.  Keyphrase based Arabic summarizer (KPAS) , 2012, 2012 8th International Conference on Informatics and Systems (INFOS).

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  Paolo Rosso,et al.  Automatic Text Summarization based on Betweenness Centrality , 2018, CERI.

[13]  Ahmad T. Al-Taani,et al.  Arabic Single-Document Text Summarization Using Particle Swarm Optimization Algorithm , 2017, ACLING.

[14]  Bilel Elayeb,et al.  Related Terms Extraction from Arabic News Corpus Using Word Embedding , 2018, OTM Workshops.

[15]  Huilei He,et al.  Design and optimized implementation of the SHA-2(256, 384, 512) hash algorithms , 2007, 2007 7th International Conference on ASIC.

[16]  Ahmed Guessoum,et al.  TALAA-ASC: A sentence compression corpus for Arabic , 2015, 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA).

[17]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[18]  Aziz Qaroush,et al.  An efficient single document Arabic text summarization using a combination of statistical and semantic features , 2019, J. King Saud Univ. Comput. Inf. Sci..

[19]  Mohamed Yehia Dahab,et al.  A Comparative Study on Arabic Stemmers , 2015 .

[20]  Udo Kruschwitz,et al.  Multi-document arabic text summarisation , 2011, 2011 3rd Computer Science and Electronic Engineering Conference (CEEC).

[21]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[22]  Ibrahim Bounhas,et al.  Arabic Cross-Language Information Retrieval , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[23]  George Giannakopoulos,et al.  Multi-document multilingual summarization corpus preparation, Part 1: Arabic, English, Greek, Chinese, Romanian , 2013 .

[24]  Bilel Elayeb,et al.  Arabic word sense disambiguation: a review , 2019, Artificial Intelligence Review.

[25]  Cengiz Hark,et al.  Karcı summarization: A simple and effective approach for automatic text summarization using Karcı entropy , 2020, Inf. Process. Manag..

[26]  Hwanjo Yu,et al.  Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization , 2016, LREC.

[27]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[28]  Ali Karci,et al.  Extractive multi-document text summarization based on graph independent sets , 2020 .

[29]  Xiaojun Wan,et al.  Phrase-Based Presentation Slides Generation for Academic Papers , 2017, AAAI.

[30]  Ahmed Ibrahim,et al.  Improve the Automatic Summarization of Arabic Text Depending on Rhetorical Structure Theory , 2013, 2013 12th Mexican International Conference on Artificial Intelligence.

[31]  Krys J. Kochut,et al.  Text Summarization Techniques: A Brief Survey , 2017, International Journal of Advanced Computer Science and Applications.

[32]  Mohamed El Bachir Menai,et al.  Automatic Arabic text summarization: a survey , 2015, Artificial Intelligence Review.

[33]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[34]  George Giannakopoulos,et al.  TAC2011 MultiLing Pilot Overview , 2011, TAC.

[35]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[36]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[37]  Gurpreet Singh Lehal,et al.  A Survey of Text Summarization Extractive Techniques , 2010 .

[38]  Nazlia Omar,et al.  Automatic multi-document Arabic text summarization using clustering and keyphrase extraction , 2015 .

[39]  Bilel Elayeb,et al.  A TF-IDF and Co-occurrence Based Approach for Events Extraction from Arabic News Corpus , 2018, NLDB.

[40]  Aqil M. Azmi,et al.  A text summarizer for Arabic , 2012, Comput. Speech Lang..

[41]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[42]  Ghassan Kanaan,et al.  Proper Noun Extracting Algorithm for Arabic Language , 2011 .

[43]  Mahmoud El-Haj,et al.  KALIMAT a multipurpose Arabic corpus , 2013 .

[44]  Fouzi Harrag,et al.  Stemming as a feature reduction technique for Arabic Text Categorization , 2011, 2011 10th International Symposium on Programming and Systems.

[45]  Bilel Elayeb,et al.  ANT Corpus: An Arabic News Text Collection for Textual Classification , 2017, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA).

[46]  Osama Mohamed Elrajubi An improved Arabic light stemmer , 2013, 2013 International Conference on Research and Innovation in Information Systems (ICRIIS).