ASMA: A System for Automatic Segmentation and Morpho-Syntactic Disambiguation of Modern Standard Arabic

In this paper, we present ASMA, a fast and efficient system for automatic segmentation and fine grained part of speech (POS) tagging of Modern Standard Arabic (MSA). ASMA performs segmentation both of agglutinative and of inflectional morphological boundaries within a word. In this work, we compare ASMA to two state of the art suites of MSA tools: AMIRA 2.1 (Diab et al., 2007; Diab, 2009) and MADA+TOKAN 3.2. (Habash et al., 2009). ASMA achieves comparable results to these two systems’ state-of-theart performance. ASMA yields an accuracy of 98.34% for segmentation, and an accuracy of 96.26% for POS tagging with ar ich tagset and 97.59% accuracy with an extremely reduced tagset. 1I ntroduction

[1]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[2]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[3]  Kadri Hacioglu,et al.  Automatic Processing of Modern Standard Arabic Text , 2007 .

[4]  Yannick Versley,et al.  Statistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither , 2010, SPMRL@NAACL-HLT.

[5]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[6]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[7]  Jan Hajic,et al.  Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[8]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[9]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[10]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[11]  Sandra Kübler,et al.  Is Arabic Part of Speech Tagging Feasible Without Word Segmentation? , 2010, NAACL.

[12]  Nizar Habash,et al.  CATiB: The Columbia Arabic Treebank , 2009, ACL.

[13]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[14]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[15]  Mona T. Diab,et al.  Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking , 2009 .

[16]  Walter Daelemans,et al.  Forgetting Exceptions is Harmful in Language Learning , 1998, Machine Learning.

[17]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[18]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[19]  Sandra Kübler,et al.  Part of speech tagging for Arabic , 2011, Natural Language Engineering.

[20]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.