Automatic diacritization of Arabic text using recurrent neural networks

This paper presents a sequence transcription approach for the automatic diacritization of Arabic text. A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences. We use a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions. This approach differs from previous approaches in that no lexical, morphological, or syntactical analysis is performed on the data before being processed by the net. Nonetheless, when the network is post-processed with our error correction techniques, it achieves state-of-the-art performance, yielding an average diacritic and word error rates of 2.09 and 5.82 %, respectively, on samples from 11 books. For the LDC ATB3 benchmark, this approach reduces the diacritic error rate by 25 %, the word error rate by 20 %, and the last-letter diacritization error rate by 33 % over the best published results.

[1]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[2]  Alan F. Murray,et al.  Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training , 1994, IEEE Trans. Neural Networks.

[3]  Kenneth R. Beesley Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[6]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[7]  Ya'akov Gal An HMM Approach to Vowel Restoration in Arabic and Hebrew , 2002, SEMITIC@ACL.

[8]  Jeff A. Bilmes,et al.  Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins Summer Workshop , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[9]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[10]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .

[11]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[12]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[13]  Volker Märgner,et al.  Arabic Handwriting Recognition Competition , 2005, ICDAR.

[14]  Karin C. Ryding,et al.  A Reference Grammar of Modern Standard Arabic , 2005 .

[15]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[16]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[17]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[18]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[19]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[20]  Jürgen Schmidhuber,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[21]  ICDAR 2009 Arabic Handwriting Recognition Competition , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[22]  P. Lewis Ethnologue : languages of the world , 2009 .

[23]  A. Graves Offline Arabic Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2010 .

[24]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[25]  Sherif Abdou,et al.  A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Yasser Hifny,et al.  Smoothing Techniques for Arabic Diacritics Restoration , 2012 .

[27]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Khe Chai Sim,et al.  A Weighted Combination of Speech with Text-based Models for Arabic Diacritization , 2012, INTERSPEECH.

[29]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[30]  Hend S. Al-Khalifa,et al.  A first approach to the evaluation of arabic diacritization systems , 2012, Seventh International Conference on Digital Information Management (ICDIM 2012).

[31]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[33]  Eslam Kamal,et al.  A Hybrid Approach for Arabic Diacritization , 2013, NLDB.

[34]  Aqil M. Azmi,et al.  A survey of automatic Arabic diacritization techniques , 2013, Natural Language Engineering.

[35]  Gheith A. Abandah,et al.  Recognizing handwritten Arabic words using grapheme segmentation and recurrent neural networks , 2014, International Journal on Document Analysis and Recognition (IJDAR).