This paper presents an efficient mechanism to convert Sana’ani dialect to modern standard Arabic. The mechanism is based on morphological rules related to Sana’ani dialect as well as Modern Standard Arabic. Such rules facilitate the dialect conversion to its corresponding MSA. The mechanism tokenizes the input dialect text and divides each token into stem and its affixes; such affixes can be categorized into two categories: dialect affixes and/or MSA affixes. At the same time, the stem could be dialect stem or MSA stem. Therefore, our mechanism, implemented by using a simple MSA stemmer, must pay attention to such situations. Then our dialect stemmer is applied to strip the resulting token and extract dialect affixes. At this point, the rules are applied to decide when to carry out the extraction of an affix. The experiment shows that Sana’ani dialect has three classes of distortions, which are prefixes, suffixes, and stems distortions. The algorithm normalizes such distortion based on the morphological rules. For each morphological rule the mechanism checks possibility of applying such a rule. That means if rule conditions be met, then the dialect affix will be replaced by its corresponding MSA. If there is no restriction on applying the rule related to the distorted stem, then the rule can be considered as a parallel corpus of the dialect and MSA. Finally, the experiment computes the distortion ratio of MSA in Sana’ani dialect. For a Sana’ani dialect sample of 9386 words, 16.29% of them have distorted suffixes, 0.70% have distorted prefixes and 2.17% contain distorted stems. These percentages are related only to the processed words.
[1]
Nathan Schneider,et al.
Association for Computational Linguistics: Human Language Technologies
,
2011
.
[2]
Daniel Jurafsky,et al.
Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks
,
2004,
NAACL.
[3]
Nizar Habash,et al.
Morphological Analysis and Generation for Arabic Dialects
,
2005,
SEMITIC@ACL.
[4]
Nizar Habash,et al.
Parsing Arabic Dialects
,
2006,
EACL.
[5]
Jce Watson,et al.
Social issues in popular Yemeni culture
,
2002
.
[6]
Nizar Habash,et al.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop
,
2005,
ACL.
[7]
D. W. Barron.
Machine Translation
,
1968,
Nature.
[8]
Nizar Habash,et al.
Arabic dialect processing
,
2006
.
[9]
Kazem Taghva,et al.
Arabic stemming without a root dictionary
,
2005,
International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.
[10]
Ann Bies,et al.
Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools
,
2004
.