Smoothing Techniques for Arabic Diacritics Restoration

An algorithm to restore Arabic diacritics using dynamic programming approach was presented in [1]. The possible word sequences with diacritics are assigned scores using statistical n-gram language modeling approach. Using the assigned scores, it is possible to search the most likely sequence using a dynamic programming algorithm. The maximum likelihood (ML) estimation of the stochastic language model parameters leads to poor diacritization accuracy due to data sparse problem. Smoothing aims to handle this problem by taking some probability mass from the observed n-gram and distribute it to the unseen n-grams. In this paper, we show that applying smoothing techniques dramatically improve the diacritization accuracy. The interpolated version of the absolute discounting method leads to the best results for dierent

[1]  Yasser Hifny,et al.  Restoration of Arabic diacritics using dynamic programming , 2013, 2013 8th International Conference on Computer Engineering & Systems (ICCES).

[2]  Mansour M. Alghamdi,et al.  KACST Arabic diacritizer , 2007 .

[3]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[4]  Mohsen Rashwan,et al.  ARABTALK ® An Implementation for Arabic Text To Speech System , 2009 .

[5]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[6]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[7]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[8]  Ya'akov Gal An HMM Approach to Vowel Restoration in Arabic and Hebrew , 2002, SEMITIC@ACL.

[9]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[10]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[11]  Sherif Abdou,et al.  A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[13]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[14]  Khaled Shaalan,et al.  A Hybrid Approach for Building Arabic Diacritizer , 2009, SEMITIC@EACL.

[15]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .