Fine-Grained POS Tagging of Spoken Tunisian Dialect Corpora

Arabic Dialects (AD) have recently begun to receive more attention from the speech science and technology communities. The use of dialects in language technologies will contribute to improve the development process and the usability of applications such speech recognition, speech comprehension, or speech synthesis. However, AD faces the problem of lack of resources compared to the Modern Standard Arabic (MSA). This paper deals with the problem of tagging an AD: The Tunisian Dialect (TD). We present, in this work, a method for building a fine grained POS (Part Of Speech tagger) for the TD. This method consists on adapting a MSA POS tagger by generating a training TD corpus from a MSA corpus using a bilingual lexicon MSA-TD. The evaluation of the TD tagger on a corpus of text transcriptions achieved an accuracy of 78.5%.