Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

We annotate 60,000 words of Classical Arabic (CA) with topics in philosophy, religion, literature, and law with fine-grain segment-based morphological descriptions. We use these annotations for building a morphological segmenter and part-of-speech (POS) tagger for CA. With character-level classification and features from the word and its lexical context, the segmenter achieves a word accuracy of 96.8% with the main issue being a high rate of out-of-vocabulary words. A token-based POS tagger achieves an accuracy of 96.22% with 97.72% on known tokens despite the small size of the corpus. An error analysis shows that most of the tagging errors are results of segmentation and that quality improves with more data being added. The morphological segmenter and tagger have a wide range of potential applications in processing CA, a low-resource variety of the language.

[1]  François Yvon,et al.  Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier , 2012, LREC.

[2]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[3]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[4]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[5]  Musaed S. Bin-Muqbil PHONETIC AND PHONOLOGICAL ASPECTS OF ARABIC EMPHATICS AND GUTTURALS , 2006 .

[6]  Sandra Kübler,et al.  Is Arabic Part of Speech Tagging Feasible Without Word Segmentation? , 2010, NAACL.

[7]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[8]  Mirella Lapata,et al.  Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05) , 2005, ACL 2005.

[9]  Dan Klein,et al.  Optimization, Maxent Models, and Conditional Estimation without Magic , 2003, NAACL.

[10]  Sandra Kübler,et al.  Part of speech tagging for Arabic , 2011, Natural Language Engineering.

[11]  Tim Buckwalter,et al.  A Dependency Treebank of the Quran using traditional Arabic grammar , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[12]  Seth Kulick,et al.  Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer , 2010, ACL.

[13]  Mona T. Diab Improved Arabic Base Phrase Chunking with a new enriched POS tag set , 2007, SEMITIC@ACL.

[15]  Yahya O. Mohamed Elhadj,et al.  Statistical Part-of-Speech Tagger for Traditional Arabic Texts , 2009 .