De l'arabe standard vers l'arabe dialectal : projection de corpus et ressources linguistiques en vue du traitement automatique de l'oral dans les médias tunisiens

RESUME. Dans ce travail, nous nous interessons aux problemes lies au traitement automatique de l'oral parle dans les medias tunisiens. Cet oral se caracterise par l'emploi de l'alternance codique entre l'arabe standard moderne (MSA) et le dialecte tunisien (DT). L'objectif consiste a construire des ressources utiles pour apprendre des modeles de langage dedies a des applications de reconnaissance automatique de la parole. Comme il s'agit d'une variante du MSA, nous decrivons dans cet article une demarche d'adaptation des ressources MSA vers le DT. Une premiere evaluation en termes de couverture lexicale et de perplexite est presentee. ABSTRACT. In this work, we focus on the problems of the automatic treatment of oral spoken in the Tunisian media. This oral is marked by the use of code-switching between the Modern Standard Arabic (MSA) and the Tunisian dialect (TD). Our goal is to build useful resources to learn language models that can be used in automatic speech recognition applications. As it is a variant of MSA, we describe in this paper an adjustment process of the MSA resources to the TD. A first evaluation in terms of lexical coverage and perplexity is presented.

[1]  Hagen Soltau,et al.  From Modern Standard Arabic to Levantine ASR: Leveraging GALE for dialects , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[2]  Andreas Stolcke,et al.  Development of a conversational telephone speech recognizer for Levantine Arabic , 2005, INTERSPEECH.

[3]  Sopheap Seng Vers une modélisation statistique multi-niveau du langage, application aux langues peu dotées. (Toward a multi-level statistical language modeling for under-resourced language) , 2010 .

[4]  Yves Scherrer,et al.  Generating Swiss German sentences from Standard German: a multi-dialectal approach , 2012 .

[5]  Otakar Smrz,et al.  ElixirFM – Implementation of Functional Arabic Morphology , 2007, SEMITIC@ACL.

[6]  GoldsmithJohn Unsupervised learning of the morphology of a natural language , 2001 .

[7]  Lamia Hadrich Belguith,et al.  Morphological Analysis of Tunisian Dialect , 2013, IJCNLP.

[8]  Nizar Habash,et al.  A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition , 2014, LREC.

[9]  Lamia Hadrich Belguith,et al.  Building Ontologies to Understand Spoken Tunisian Dialect , 2011, ArXiv.

[10]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[11]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[12]  K. R. Beesley Arabic Morphological Analysis on the Internet , 2007 .

[13]  A. BOUDLAL,et al.  A Morphosyntactic analysis system for Arabic texts , 2010 .

[14]  Nizar Habash,et al.  Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic , 2013, NAACL.

[15]  Nizar Habash,et al.  Un système de traduction de verbes entre arabe standard et arabe dialectal par analyse morphologique profonde , 2013 .

[16]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[17]  Husni Al-Muhtaseb,et al.  Statistical Methods for Automatic diacritization of Arabic text , 2006 .

[18]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[19]  Martine Adda-Decker A corpus-based decompounding algorithm for German lexical modeling in LVCSR , 2003, INTERSPEECH.

[20]  Andreas Stolcke,et al.  Morphology-based language modeling for arabic speech recognition , 2004, INTERSPEECH.

[21]  Denyse Baillargeon,et al.  Bibliographie , 1929 .

[22]  Brian Kingsbury,et al.  The IBM 2008 GALE Arabic speech transcription system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[24]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[25]  Lamia Hadrich Belguith,et al.  Fine-Grained POS Tagging of Spoken Tunisian Dialect Corpora , 2014, NLDB.

[26]  J. Fishman Bilingualism with and without diglossia; diglossia with and without bilingualism , 1967, The Bilingualism Reader.

[27]  N. Boukadida Connaissances phonologiques et morphologiques dérivationnelles et apprentissage de la lecture en arabe (Etude longitudinale) , 2008 .

[28]  Lamia Hadrich Belguith,et al.  Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora , 2013, IJCNLP.

[29]  Jean-Luc Gauvain,et al.  Investigating morphological decomposition for transcription of Arabic broadcast news and broadcast conversation data , 2008, INTERSPEECH.

[30]  Nizar Habash,et al.  The Effects of Factorizing Root and Pattern Mapping in Bidirectional Tunisian - Standard Arabic Machine Translation , 2013, MTSUMMIT.

[31]  Ann Bies,et al.  Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools , 2004 .

[32]  Bing Xiang,et al.  Morphological Decomposition for Arabic Broadcast News Transcription , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[33]  Lamia Hadrich Belguith,et al.  LEXICAL STUDY OF A SPOKEN DIALOGUE CORPUS IN TUNISIAN DIALECT , 2010 .

[34]  Dimitra Vergyri,et al.  Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition , 2005, Speech Commun..

[35]  Khaled Shaalan,et al.  Transferring Egyptian Colloquial Dialect into Modern Standard Arabic , 2007 .

[36]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.