Collaboratively Constructed Linguistic Resources for Language Variants and their Exploitation in NLP Application - the case of Tunisian Arabic and the Social Media

Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for the translation of texts of social media. More precisely, this paper focuses on the Tunisian Dialect of Arabic (TAD) with an application on automatic machine translation for a social media text into MSA and any other target language. Linguistic tools such as a bilingual TAD-MSA lexicon and a set of grammatical mapping rules are collaboratively constructed and exploited in addition to a language model to produce MSA sentences of Tunisian dialectal sentences. This work is a first-step towards collaboratively constructed semantic and lexical resources for Arabic Social Media within the ASMAT (Arabic Social Media Analysis Tools) project.

[1]  Lamia Hadrich Belguith,et al.  Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora , 2013, IJCNLP.

[2]  Khaled Shaalan,et al.  A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic , 2008 .

[3]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[4]  H. Sawaf Arabic Dialect Handling in Hybrid Machine Translation , 2010, AMTA.

[5]  Nizar Habash,et al.  Un système de traduction de verbes entre arabe standard et arabe dialectal par analyse morphologique profonde , 2013 .

[6]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[7]  Khaled Shaalan,et al.  Rule-based Approach in Arabic Natural Language Processing , 2010 .

[8]  Nizar Habash,et al.  Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation , 2011, EMNLP 2011.

[9]  Roxana Girju,et al.  Mining the Web for the Induction of a Dialectical Arabic Lexicon , 2010, LREC.

[10]  Lamia Hadrich Belguith,et al.  Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model , 2013, HyTra@ACL.

[11]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[12]  Dimitra Vergyri,et al.  Cross-dialectal acoustic data sharing for Arabic speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Nizar Habash,et al.  Arabic Dialect Processing Tutorial , 2012, HLT-NAACL.

[14]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[15]  K. Almeman,et al.  Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[16]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[17]  Nizar Habash,et al.  The Effects of Factorizing Root and Pattern Mapping in Bidirectional Tunisian - Standard Arabic Machine Translation , 2013, MTSUMMIT.

[18]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19]  Jeff A. Bilmes,et al.  Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins Summer Workshop , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[20]  Ralph Arnote,et al.  Hong Kong (China) , 1996, OECD/G20 Base Erosion and Profit Shifting Project.

[21]  Yonatan Belinkov,et al.  Translating Dialectal Arabic to English , 2013, ACL.