Exploiting Arabic Diacritization for High Quality Automatic Annotation

We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined.

[1]  Majid A. Al-Taee,et al.  Automatic diacritization of Arabic text using recurrent neural networks , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[2]  Nizar Habash,et al.  Automatic Morphological Enrichment of a Morphologically Underspecified Treebank , 2013, NAACL.

[3]  Nizar Habash,et al.  Determining Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic Features , 2007, EMNLP.

[4]  Kemal Oflazer,et al.  A Pilot Study on Arabic Multi-Genre Corpus Diacritization , 2015, ANLP@ACL.

[5]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[6]  Nizar Habash,et al.  LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual , 2013, ArXiv.

[7]  Ahmed Guessoum,et al.  Restoration of Arabic Diacritics Using a Multilevel Statistical Model , 2015, CIIA.

[8]  Mansour M. Alghamdi,et al.  KACST Arabic diacritizer , 2007 .

[9]  M. Maamouri,et al.  Creating a Methodology for Large-Scale Correction of Treebank Annotation : The Case of the Arabic Treebank , 2009 .

[10]  Yonatan Belinkov,et al.  Arabic Diacritization with Recurrent Neural Networks , 2015, EMNLP.

[11]  Magdy Nagi,et al.  The International Corpus of Arabic: Compilation, Analysis and Evaluation , 2014, ANLP@EMNLP.

[12]  Azzeddine Mazroui,et al.  Hybrid approaches for automatic vowelization of Arabic texts , 2014, ArXiv.

[13]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[14]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[15]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .

[16]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[17]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[18]  Seth Kulick,et al.  Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines , 2008, LREC.

[19]  Seth Kulick,et al.  Diacritization: A Challenge to Arabic Treebank Annotation and Parsing , 2006, BCS.

[20]  Nizar Habash,et al.  CATiB: The Columbia Arabic Treebank , 2009, ACL.

[21]  Nizar Habash,et al.  Improving Arabic Diacritization through Syntactic Analysis , 2015, EMNLP.

[22]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[23]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.