Statistical Methods for Automatic diacritization of Arabic text

In this paper, the issue of adding diacritics Tashkeel to undiacritized Arabic text using statistical methods for language modeling is addressed. The approach requires a large corpus of fully diacritized text for extracting the language monograms, bigrams, and trigrams for words and letters. Search algorithms are then used o find the best probable sequence of diacritized words of a given undiacritized word sequence. The word sequence of undiacritized Arabic text is considered an observation sequence from a hidden Markov Model, where the hidden states are the possible diacritized expressions of the words. The optimal sequence of diacritized words (or states) is then efficiently obtained using Viterbi Algorithm. We present an evaluation of the basic algorithm using the Qur’an’s text, and discuss various ramifications for improving the performance of this approach.