Higher Order n-gram Language Models for Arabic Diacritics Restoration

Dynamic programming based Arabic diacritics restoration aims to assign diacritics to Arabic words. The technique is purely statistical approach and depends only on an Arabic corpus annotated with diacritics. The possible word sequences with diacritics are assigned scores using statistical n-gram language modeling approach. Using the assigned scores, it is possible to search the most likely sequence using a dynamic programming algorithm. In previous work [1], the assigned scores are based on a bigram stochastic language model and the decoder was restricted to this model. Using higher order n-gram language may lead to better diacritization accuracy. In this work, we extend the dynamic programming decoding algorithm to support higher order language models. Preliminary results on a public domain corpus show that dynamic programming decoding based on higher order n-gram models can lead to better results than bigram models. Index Terms: Arabic diacritics restoration, dynamic programming, statistical language modeling, smoothing

[1]  Ya'akov Gal An HMM Approach to Vowel Restoration in Arabic and Hebrew , 2002, SEMITIC@ACL.

[2]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[3]  Mohsen Rashwan,et al.  ARABTALK ® An Implementation for Arabic Text To Speech System , 2009 .

[4]  Yasser Hifny,et al.  Smoothing Techniques for Arabic Diacritics Restoration , 2012 .

[5]  Yasser Hifny,et al.  Restoration of Arabic diacritics using dynamic programming , 2013, 2013 8th International Conference on Computer Engineering & Systems (ICCES).

[6]  Sherif Abdou,et al.  A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[8]  Khaled Shaalan,et al.  A Hybrid Approach for Building Arabic Diacritizer , 2009, SEMITIC@EACL.

[9]  Mansour M. Alghamdi,et al.  KACST Arabic diacritizer , 2007 .

[10]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[11]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[12]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[13]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.