Improving Arabic Diacritization through Syntactic Analysis

We present an approach to Arabic automatic diacritization that integrates syntactic analysis with morphological tagging through improving the prediction of case and state features. Our best system increases the accuracy of word diacritization by 2.5% absolute on all words, and 5.2% absolute on nominals over a state-of-theart baseline. Similar increases are shown on the full morphological analysis choice.

[1]  Nizar Habash,et al.  Determining Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic Features , 2007, EMNLP.

[2]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[3]  Nizar Habash,et al.  Syntactic Annotation in the Columbia Arabic Treebank , 2009 .

[4]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[5]  Seth Kulick,et al.  Diacritization: A Challenge to Arabic Treebank Annotation and Parsing , 2006, BCS.

[6]  Richard M. Schwartz,et al.  Decision Trees for Lexical Smoothing in Statistical Machine Translation , 2010, WMT@ACL.

[7]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[8]  Nizar Habash,et al.  CATiB: The Columbia Arabic Treebank , 2009, ACL.

[9]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[10]  M. Maamouri,et al.  Creating a Methodology for Large-Scale Correction of Treebank Annotation : The Case of the Arabic Treebank , 2009 .

[11]  Ahmed Guessoum,et al.  Restoration of Arabic Diacritics Using a Multilevel Statistical Model , 2015, CIIA.

[12]  Azzeddine Mazroui,et al.  Hybrid approaches for automatic vowelization of Arabic texts , 2014, ArXiv.

[13]  Nizar Habash,et al.  Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules , 2009, HLT-NAACL.

[14]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[15]  Kemal Oflazer,et al.  A Pilot Study on Arabic Multi-Genre Corpus Diacritization , 2015, ANLP@ACL.

[16]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[17]  Majid A. Al-Taee,et al.  Automatic diacritization of Arabic text using recurrent neural networks , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[18]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[19]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[20]  Nizar Habash,et al.  LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual , 2013, ArXiv.

[21]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[22]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[23]  Sherif Abdou,et al.  Stochastic Arabic hybrid diacritizer , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[24]  Mansour M. Alghamdi,et al.  KACST Arabic diacritizer , 2007 .

[25]  Nizar Habash,et al.  Automatic Morphological Enrichment of a Morphologically Underspecified Treebank , 2013, NAACL.

[26]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .

[27]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[28]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[29]  Nizar Habash,et al.  Dependency Parsing of Modern Standard Arabic with Lexical and Inflectional Features , 2013, CL.

[30]  Otakar Smrž Functional Arabic Morphology: Formal System and Implementation , 2007 .

[31]  Moustafa Elshafei,et al.  Techniques for high quality Arabic speech synthesis , 2002, Inf. Sci..