A hybrid Arabic POS tagging for simple and compound morphosyntactic tags

Abstract The objective of this work is to develop a POS tagger for the Arabic language. This analyzer uses a very rich tag set that gives syntactic information about proclitic attached to words. This study employs a probabilistic model and a morphological analyzer to identify the right tag in the context. Most published research on probabilistic analysis uses only a training corpus to search the probable tags for each words, and this sometimes affects their performances. In this paper, we propose a method that takes into account the tags that are not included in the training data. These tags are proposed by the Alkhalil_Morpho_Sys analyzer (Bebah et al. 2011). We show that this consideration increases significantly the accuracy of the morphosyntactic analysis. In addition, the adopted tag set is very rich and it contains the compound tags that allow analyze the proclitics attached to words.

[1]  David L. Neuhoff,et al.  The Viterbi algorithm as an aid in text recognition (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[2]  Mélanie Thibeault La catégorisation grammaticale automatique : adaptation du catégoriseur de Brill au français et modification de l'approche , 2004 .

[3]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[4]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[5]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[6]  Selvadoss Thanamani Dr.Antony Parts Of Speech Tagging for Indian Languages: A Literature Survey , 2011 .

[7]  Dhaou Ghoul Outils génériques pour l'étiquetage morphosyntaxique de la langue arabe : segmentation et corpus d'entraînement , 2011 .

[8]  Ahmed Abdelali,et al.  Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging , 2014, LREC.

[9]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[10]  Huan Wang,et al.  Statistical Part-of-Speech Tagging for Classical Chinese , 2002, TSD.

[11]  Ahmed Guessoum,et al.  A Hidden Markov Model -Based POS Tagger for Arabic , 2006 .

[12]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[13]  Tetsuji Nakagawa,et al.  A Hybrid Approach to Word Segmentation and POS Tagging , 2007, ACL.

[14]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[15]  Abdellah Yousfi,et al.  Morpho-syntactic tagging system based on the patterns words for arabic texts , 2011, Int. Arab J. Inf. Technol..

[16]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[17]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[18]  Ahmad T. Al-Taani,et al.  A rule-based approach for tagging non-vocalized Arabic words , 2009, Int. Arab J. Inf. Technol..