论文信息 - The International Corpus of Arabic: Compilation, Analysis and Evaluation

The International Corpus of Arabic: Compilation, Analysis and Evaluation

This paper focuses on a project for building the first International Corpus of Arabic (ICA). It is planned to contain 100 million analyzed tokens with an interface which allows users to interact with the corpus data in a number of ways [ICA website]. ICA is a representative corpus of Arabic that has been initiated in 2006, it is intended to cover the Modern Standard Arabic (MSA) language as being used all over the Arab world. ICA has been analyzed by Bibliotheca Alexandrina Morphological Analysis Enhancer (BAMAE). BAMAE is based on Buckwalter Arabic Morphological Analyzer (BAMA). Precision and Recall are the evaluation measures used to evaluate the BAMAE system. At this point, Precision measurement ranges from 95%-92% while recall measurement was 92%-89%. This depends on the number of qualifiers retrieved for every word. The percentages are expected to rise by implementing the improvements while working on larger amounts of data.

Magdy Nagi | Sameh Alansary | S. Alansary | M. Nagi

[1] Ahmed Abdelali,et al. Building A Modern Standard Arabic Corpus , 2004 .

[2] Nizar Habash,et al. MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[3] Jan Hajic,et al. Prague Arabic Dependency Treebank: Development in Data and Tools , 2004 .

[4] Nizar Habash,et al. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[5] Noha Adly,et al. Towards Analyzing the International Corpus of Arabic ( ICA ) : Progress of Morphological Stage , 2008 .

[6] Nizar Habash,et al. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[7] Camino Rea Rizzo. GETTING ON WITH CORPUS COMPILATION: FROM THEORY TO PRACTICE. , 2010 .

[8] Mahmoud El-Haj,et al. KALIMAT a multipurpose Arabic corpus , 2013 .

[9] M. Maamouri,et al. The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[10] C. Meyer. English Corpus Linguistics An Introduction , 2002 .

[11] Jan Hajiÿc,et al. Feature-Based Tagger of Approximations of Functional Arabic Morphology , 2005 .