Part-of-Speech Tagging for Arabic Gulf Dialect Using Bi-LSTM

Part-of-speech (POS) tagging is one of the most important addressed areas in the natural language processing (NLP). There are effective POS taggers for many languages including Arabic. However, POS research for Arabic focused mainly on Modern Standard Arabic (MSA), while less attention was directed towards Dialect Arabic (DA). MSA is the formal variant which is mainly found in news and formal text books, while DA is the informal spoken Arabic that varies among different regions in the Arab world. DA is heavily used online due to the large spread of social media, which increased research directions towards building NLP tools for DA. Most research on DA focuses on Egyptian and Levantine, while much less attention is given to the Gulf dialect. In this paper, we present a more effective POS tagger for the Arabic Gulf dialect than currently available Arabic POS taggers. Our work includes preparing a POS tagging dataset, engineering multiple sets of features, and applying two machine learning methods, namely Support Vector Machine (SVM) classifier and bi-directional Long Short Term Memory (Bi-LSTM) for sequence modeling. We have improved POS tagging for Gulf dialect from 75% accuracy using a state-of-the-art MSA POS tagger to over 91% accuracy using a Bi-LSTM labeler.

[1]  Ibrahim Abu El-Khair,et al.  Arabic information retrieval , 2007, Annu. Rev. Inf. Sci. Technol..

[2]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[3]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[4]  Craig Hagerman,et al.  Evaluating the Performance of Automated Part-of-Speech Taggers on an L2 Corpus , 2011 .

[5]  Hai Zhao,et al.  A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding , 2015, ArXiv.

[6]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[7]  Roxana Girju,et al.  A supervised POS tagger for written Arabic social networking corpora , 2012, KONVENS.

[8]  Nizar Habash,et al.  Arabic Dialect Processing Tutorial , 2012, HLT-NAACL.

[9]  Ahmed Abdelali,et al.  Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging , 2014, LREC.

[10]  Walid Magdy,et al.  Language processing for arabic microblog retrieval , 2012, CIKM.

[11]  Mona T. Diab,et al.  Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking , 2009 .

[12]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[13]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Abdelhak Lakhouaja,et al.  Towards a standard Part of Speech tagset for the Arabic language , 2017, J. King Saud Univ. Comput. Inf. Sci..

[16]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[17]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[18]  Ahmed Abdelali,et al.  Arabic POS Tagging: Don’t Abandon Feature Engineering Just Yet , 2017, WANLP@EACL.

[19]  Azzeddine Mazroui,et al.  A Markovian approach for arabic root extraction , 2011, Int. Arab J. Inf. Technol..

[20]  Laura Kallmeyer,et al.  Learning from Relatives: Unified Dialectal Arabic Segmentation , 2017, CoNLL.

[21]  Kevin Duh,et al.  POS Tagging of Dialectal Arabic: A Minimally Supervised Approach , 2005, SEMITIC@ACL.

[22]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[23]  Mona T. Diab,et al.  COLABA : Arabic Dialect Annotation and Processing , 2011 .

[24]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[25]  Jes Us Gim Enez And Llu Fast and Accurate Part{of{speech Tagging: the Svm Approach Revisited , 2003 .

[26]  Belal Abu Ata,et al.  A rule-based stemmer for Arabic Gulf dialect , 2015, J. King Saud Univ. Comput. Inf. Sci..

[27]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[28]  Barbara Plank,et al.  Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[29]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[30]  Nizar Habash,et al.  A Large Scale Corpus of Gulf Arabic , 2016, LREC.

[31]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[32]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[33]  Mona T. Diab,et al.  Introduction to the Special Issue on Arabic Computational Linguistics , 2011, TALIP.