Neural Arabic Text Diacritization: State of the Art Results and a Novel Approach for Machine Translation

In this work, we present several deep learning models for the automatic diacritization of Arabic text. Our models are built using two main approaches, viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN), with several enhancements such as 100-hot encoding, embeddings, Conditional Random Field (CRF) and Block-Normalized Gradient (BNG). The models are tested on the only freely available benchmark dataset and the results show that our models are either better or on par with other models, which require language-dependent post-processing steps, unlike ours. Moreover, we show that diacritics in Arabic can be used to enhance the models of NLP tasks such as Machine Translation (MT) by proposing the Translation over Diacritization (ToD) approach.

[1]  Sameh Alansary,et al.  SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts , 2017, WANLP@EACL.

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.


[4]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[5]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[6]  Majid A. Al-Taee,et al.  Automatic diacritization of Arabic text using recurrent neural networks , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[7]  Aqil M. Azmi,et al.  A survey of automatic Arabic diacritization techniques , 2013, Natural Language Engineering.

[8]  Amar Balla,et al.  Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems , 2017, Data in brief.

[9]  Ruhi Sarikaya,et al.  Arabic diacritic restoration approach based on maximum entropy models , 2009, Comput. Speech Lang..

[10]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Mona T. Diab,et al.  Homograph Disambiguation through Selective Diacritic Restoration , 2019, WANLP@ACL 2019.

[13]  Mahmoud Al-Ayyoub,et al.  Arabic Text Diacritization Using Deep Neural Networks , 2019, 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS).

[14]  Azzeddine Mazroui,et al.  Morphological, syntactic and diacritics rules for automatic diacritization of Arabic sentences , 2017, J. King Saud Univ. Comput. Inf. Sci..

[15]  Yonatan Belinkov,et al.  Arabic Diacritization with Recurrent Neural Networks , 2015, EMNLP.

[16]  Rdouan Faizi,et al.  Evaluation of Gated Recurrent Unit in Arabic Diacritization , 2018 .

[17]  Azzeddine Mazroui,et al.  Hybrid approaches for automatic vowelization of Arabic texts , 2014, ArXiv.

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  Phil Blunsom,et al.  Optimizing Performance of Recurrent Neural Networks on GPUs , 2016, ArXiv.

[20]  Ahmed Abdelali,et al.  Highly Effective Arabic Diacritization using Sequence to Sequence Modeling , 2019, NAACL.

[21]  Aqil M. Azmi,et al.  Automatic minimal diacritization of Arabic texts , 2017, ACLING.

[22]  Gui-Bin Bian,et al.  Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications , 2018, IEEE Access.

[23]  Ahmed Abdelali,et al.  Arabic Diacritization: Stats, Rules, and Hacks , 2017, WANLP@EACL.

[24]  Nizar Habash,et al.  Improving Arabic Diacritization through Syntactic Analysis , 2015, EMNLP.

[25]  Saba' Alqudah,et al.  Investigating hybrid approaches for Arabic text diacritization with recurrent neural networks , 2017, 2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT).

[26]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.