Part of speech tagging for Arabic

This paper presents an investigation of part of speech (POS) tagging for Arabic as it occurs naturally, i.e. unvocalized text (without diacritics). We also do not assume any prior tokenization, although this was used previously as a basis for POS tagging. Arabic is a morphologically complex language, i.e. there is a high number of inflections per word; and the tagset is larger than the typical tagset for English. Both factors, the second one being partly dependent on the first, increase the number of word/tag combinations, for which the POS tagger needs to find estimates, and thus they contribute to data sparseness. We present a novel approach to Arabic POS tagging that does not require any pre-processing, such as segmentation or tokenization: whole word tagging. In this approach, the complete word is assigned a complex POS tag, which includes morphological information. A competing approach investigates the effect of segmentation and vocalization on POS tagging to alleviate data sparseness and ambiguity. In the segmentation-based approach, we first automatically segment words and then POS tags the segments. The complex tagset encompasses 993 POS tags, whereas the segment-based tagset encompasses only 139 tags. However, segments are also more ambiguous, thus there are more possible combinations of segment tags. In realistic situations, in which we have no information about segmentation or vocalization, whole word tagging reaches the highest accuracy of 94.74%. If gold standard segmentation or vocalization is available, including this information improves POS tagging accuracy. However, while our automatic segmentation and vocalization modules reach state-of-the-art performance, their performance is not reliable enough for POS tagging and actually impairs POS tagging performance. Finally, we investigate whether a reduction of the complex tagset to the Extra-Reduced Tagset as suggested by Habash and Rambow (Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, USA, pp. 573-80) will alleviate the data sparseness problem. While the POS tagging accuracy increases due to the smaller tagset, a closer look shows that using a complex tagset for POS tagging and then converting the resulting annotation to the smaller tagset results in a higher accuracy than tagging using the smaller tagset directly.

[1]  Nizar Habash,et al.  Determining Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic Features , 2007, EMNLP.

[2]  van Gerardus Noord,et al.  Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) , 2010 .

[3]  Sandra Kübler,et al.  Is Arabic Part of Speech Tagging Feasible Without Word Segmentation? , 2010, NAACL.

[4]  Reut Tsarfaty,et al.  Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages , 2010, HLT-NAACL 2010.

[5]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[6]  Michael Elhadad,et al.  An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation , 2006, ACL.

[7]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[8]  Ossama Emam,et al.  Language Model Based Arabic Word Segmentation , 2003, ACL.

[9]  Günter Neumann,et al.  Arabic Computational Morphology , 2007 .

[10]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[11]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[12]  Shuly Wintner,et al.  Morphological Disambiguation of Hebrew: A Case Study in Classifier Combination , 2007, EMNLP.

[13]  Kenneth R. Beesley Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.

[14]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[15]  No Value,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC) , 2004 .

[16]  Walter Daelemans,et al.  Forgetting Exceptions is Harmful in Language Learning , 1998, Machine Learning.

[17]  Reut Tsarfaty,et al.  Integrated Morphological and Syntactic Disambiguation for Modern Hebrew , 2006, ACL.

[18]  Khalil Sima'an,et al.  Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) , 2008 .

[19]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[20]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[21]  Walter Daelemans,et al.  Memory-Based Learning: Using Similarity for Smoothing , 1997, ACL.

[22]  Khalil Sima'an,et al.  Part-of-speech tagging of Modern Hebrew text , 2008, Natural Language Engineering.

[23]  Nizar Habash,et al.  Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages , 2005 .

[24]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[25]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[26]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[27]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[28]  Khaled Shaalan,et al.  A Hybrid Approach for Building Arabic Diacritizer , 2009, SEMITIC@EACL.

[29]  Eric K. Ringger,et al.  Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM , 2010, NAACL.

[30]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[31]  Nizar Habash,et al.  Morphological Analysis and Generation for Arabic Dialects , 2005, SEMITIC@ACL.

[32]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[33]  Sandra Kübler,et al.  Diacritization for Real-World Arabic Texts , 2009, RANLP.

[34]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[35]  Antal van den Bosch,et al.  Memory-based morphological analysis and part-of-speech tagging of Arabic , 2007 .

[36]  Christopher D. Manning,et al.  Better Arabic Parsing: Baselines, Evaluations, and Analysis , 2010, COLING.

[37]  Khalil Sima'an,et al.  Building a tree-bank of modern hebrew text , 2001 .

[38]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[39]  Sandra Kübler,et al.  Arabic Part of Speech Tagging , 2010, LREC.

[40]  Lauri Karttunen,et al.  Two-level rule compiler , 1992 .

[41]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[42]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[43]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[44]  Reut Tsarfaty,et al.  Word-Based or Morpheme-Based? Annotation Strategies for Modern Hebrew Clitics , 2008, LREC.

[45]  Nizar Habash,et al.  Improving Arabic Dependency Parsing with Lexical and Inflectional Morphological Features , 2010, SPMRL@NAACL-HLT.

[46]  Ya'akov Gal An HMM Approach to Vowel Restoration in Arabic and Hebrew , 2002, SEMITIC@ACL.