Arabic Diacritization with Gated Recurrent Unit

Arabic and similar languages require the use of diacritics in order to determine the necessary parameters to pronounce and identify every part of the speech correctly. Therefore, when it comes to perform Natural Language Processing (NLP) over Arabic, diacritization is a crucial step. In this paper we use a gated recurrent unit network as a language-independent framework for Arabic diacritization. The end-to-end approach allows to use exclusively vocalized text to train the system without using external resources. Evaluation is performed versus the state-of-the-art literature results. We demonstrate that we achieve state-of-the-art results and enhance the learning process by scoring better performance in the training and testing timing.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[3]  Amar Balla,et al.  Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems , 2017, Data in brief.

[4]  Ruhi Sarikaya,et al.  Arabic diacritic restoration approach based on maximum entropy models , 2009, Comput. Speech Lang..

[5]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[6]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[7]  Yonatan Belinkov,et al.  Arabic Diacritization with Recurrent Neural Networks , 2015, EMNLP.

[8]  Mohamed Boudchiche,et al.  Evaluation of the ambiguity caused by the absence of diacritical marks in Arabic texts: Statistical study , 2015, 2015 5th International Conference on Information & Communication Technology and Accessibility (ICTA).

[9]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[10]  Sameh Alansary Alserag: An Automatic Diacritization System for Arabic , 2016, AISI.

[11]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Majid A. Al-Taee,et al.  Automatic diacritization of Arabic text using recurrent neural networks , 2015, International Journal on Document Analysis and Recognition (IJDAR).