Restoration of Arabic diacritics using dynamic programming

Arabic script can be written with diacritics or without diacritics. In normal situation, Arabic text is written without the diacritics (e.g. Arabic newspapers). When the diacritics are present, the Arabic script provides enough information about the correct pronunciation and the meaning of the words. Assigning the correct diacritics to Arabic words is a complex task implying morphology, syntax, and semantic processing. The goal of this research is to develop an automatic system to assign diacritics to Arabic words. The presented technique is purely statistical approach and depends only on an Arabic corpus annotated with diacritics. In this paper, we present an algorithm to restore Arabic diacritics using dynamic programming approach. The possible word sequences with diacritics are assigned scores using statistical n-gram language modeling approach. Using the assigned scores, it is possible to search the most likely sequence using a dynamic programming algorithm. When case ending is ignored (i.e the diacritic mark of last letter), preliminary results on a public domain corpus show that the algorithm can lead to good results.

[1]  Yasser Hifny,et al.  Higher Order n-gram Language Models for Arabic Diacritics Restoration , 2012 .

[2]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[3]  Khaled Shaalan,et al.  A Hybrid Approach for Building Arabic Diacritizer , 2009, SEMITIC@EACL.

[4]  Mansour M. Alghamdi,et al.  KACST Arabic diacritizer , 2007 .

[5]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[6]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[7]  Mohsen Rashwan,et al.  ARABTALK ® An Implementation for Arabic Text To Speech System , 2009 .

[8]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[9]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[10]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[13]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[14]  Ya'akov Gal An HMM Approach to Vowel Restoration in Arabic and Hebrew , 2002, SEMITIC@ACL.

[15]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[16]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[17]  Sherif Abdou,et al.  A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Yasser Hifny,et al.  Smoothing Techniques for Arabic Diacritics Restoration , 2012 .