Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora

Nowadays in tunisia, the arabic Tunisian Dialect (TD) has become progressively used in interviews, news and debate programs instead of Modern Standard Arabic (MSA). Thus, this gave birth to a new kind of language. Indeed, the majority of speech is no longer made in MSA but alternates between MSA and TD. This situation has important negative consequences on Automatic Speech Recognition (ASR): since the spoken dialects are not officially written and do not have a standard orthography, it is very costly to obtain adequate annotated corpora to use for training language models and building vocabulary. There are neither parallel corpora involving Tunisian dialect and MSA nor dictionaries. In this paper, we describe a method for building a bilingual dictionary using explicit knowledge about the relation between TD and MSA. We also present an automatic process for creating Tunisian Dialect

[1]  Lamia Hadrich Belguith,et al.  Orthographic Transcription for Spoken Tunisian Arabic , 2013, CICLing.

[2]  Mona T. Diab,et al.  COLABA : Arabic Dialect Annotation and Processing , 2011 .

[3]  Khaled Shaalan,et al.  A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic , 2008 .

[4]  Abdessatar Mahfoudhi,et al.  A Minimalist Account of Word Order and Agreement Variation in Arabic , 2002 .

[5]  Kemal Oflazer,et al.  Transforming Standard Arabic to Colloquial Arabic , 2012, ACL.

[6]  K. Brustad The Syntax of Spoken Arabic: A Comparative Study of Moroccan, Egyptian, Syrian, and Kuwaiti Dialects. , 2002 .

[7]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[8]  Roxana Girju,et al.  Mining the Web for the Induction of a Dialectical Arabic Lexicon , 2010, LREC.

[9]  Taieb Baccouche La Langue arabe : spécificités et évolution , 2003 .

[10]  Nizar Habash,et al.  Un système de traduction de verbes entre arabe standard et arabe dialectal par analyse morphologique profonde , 2013 .

[11]  Otakar Smrž Viktor Bielický Iveta Kouřilová Jakub Kráčmar Zemánek Dependency Treebank : A Word on the Million Words , 2008 .

[12]  Juan-Manuel Torres-Moreno,et al.  Boîte à outils TAL pour des langues peu informatisées : le cas du somali , 2006 .

[13]  Clive Holes,et al.  Modern Arabic: Structures, Functions, and Varieties , 1996 .

[14]  Ann Bies,et al.  Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools , 2004 .

[15]  Seth Kulick,et al.  From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News , 2010, LREC.

[16]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[17]  Yves Scherrer Transducteurs à fenêtre glissante pour l’induction lexicale , 2008, JEPTALNRECITAL.

[18]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.