A Silver Standard Arabic Corpus for Segmentation and Validation

The Arabic Natural Language Processing applications suffer from the deficiency of both Arabic corpus and gold standard corpus. Defined as a collection of written or spoken texts stored on a computer, a corpus is written either in a single language, Monolingual Corpus or in several languages, Multilingual Corpus. A corpus is considered as the most important sources for semantic and syntaxic analysis in the domain of natural language processing. Our study aims to build a New Silver Arabic Corpus collected from a set of Newspaper Articles morphologically analyzed. It contains 18,167,183 words in total incorporating six categories, Religion, Economy, Culture, Sports, Local and International News. It is encoded namely in UTF-8 encoding and XML. This silver corpus can be used as an accurate reference for validation and learning in the syntaxic analysis mainly for the word segmentation and part of speech tagging. Keywords—Arabic Language, Arabic Natural Language Process, Validation, Information Retrieval, Silver standard corpus.

[1]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[2]  Djamel Mostefa,et al.  TECHLIMED@QALB-Shared Task 2015: a hybrid Arabic Error Correction System , 2015, ANLP@ACL.

[3]  Mohamed Abdelmageed Mansour,et al.  The Absence of Arabic Corpus Linguistics: A Call for Creating an Arabic National Corpus , 2013 .

[4]  Ahmed Abdelali,et al.  Building A Modern Standard Arabic Corpus , 2004 .

[5]  Khalid Choukri,et al.  The european language resources association , 1998, LREC.

[6]  Magdy Nagi,et al.  The International Corpus of Arabic: Compilation, Analysis and Evaluation , 2014, ANLP@EMNLP.

[7]  Mahmoud El-Haj,et al.  KALIMAT a multipurpose Arabic corpus , 2013 .

[8]  Abdelhak Lakhouaja,et al.  Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go , 2018 .

[9]  Wajdi Zaghouani Critical Survey of the Freely Available Arabic Corpora , 2017, ArXiv.

[10]  Nizar Habash,et al.  Morphological Analysis and Generation of Arabic Nouns: A Morphemic Functional Approach , 2010, LREC.

[11]  Mourad Abbas,et al.  Comparison of Topic Identification methods for Arabic Language , 2005 .

[12]  Amar Balla,et al.  Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems , 2017, Data in brief.

[13]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[14]  Kamel Smaïli,et al.  Evaluation of Topic Identification Methods on Arabic Corpora , 2011, J. Digit. Inf. Manag..