The problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the short vowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In the experiments reported here, we define vocalization as a classification problem in which we decide for each character in the unvocalized word whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that the combination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical context is added, the results deteriorate slowly.
[1]
Ya'akov Gal.
An HMM Approach to Vowel Restoration in Arabic and Hebrew
,
2002,
SEMITIC@ACL.
[2]
Walter Daelemans,et al.
TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide
,
1998
.
[3]
Stuart M. Shieber,et al.
Arabic Diacritization Using Weighted Finite-State Transducers
,
2005,
SEMITIC@ACL.
[4]
Ruhi Sarikaya,et al.
Maximum Entropy Based Restoration of Arabic Diacritics
,
2006,
ACL.
[5]
Walter Daelemans,et al.
Forgetting Exceptions is Harmful in Language Learning
,
1998,
Machine Learning.