A Spelling Correction Corpus for Multiple Arabic Dialects

Arabic dialects are the non-standard varieties of Arabic commonly spoken – and increasingly written on social media – across the Arab world. Arabic dialects do not have standard orthographies, a challenge for natural language processing applications. In this paper, we present the MADAR CODA Corpus, a collection of 10,000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) represented in the Conventional Orthography for Dialectal Arabic (CODA) in parallel with their raw original form. The sentences come from the Multi-Arabic Dialect Applications and Resources (MADAR) Project and are in parallel across the cities (2,000 sentences from each city). This publicly available resource is intended to support research on spelling correction and text normalization for Arabic dialects. We present results on a bootstrapping technique we use to speed up the CODA annotation, as well as on the degree of similarity across the dialects before and after CODA annotation.

[1]  Nizar Habash,et al.  A Conventional Orthography for Algerian Arabic , 2015, ANLP@ACL.

[2]  Nizar Habash,et al.  A Morphologically Annotated Corpus of Emirati Arabic , 2018, LREC.

[3]  Kemal Oflazer,et al.  Large Scale Arabic Error Annotation: Guidelines and Framework , 2014, LREC.

[4]  Stephan Vogel,et al.  Advances in dialectal Arabic speech recognition: a study using Twitter to improve Egyptian ASR , 2014, IWSLT.

[5]  Nizar Habash,et al.  Towards Variability Resistant Dialectal Speech Evaluation , 2019, INTERSPEECH.

[6]  Wajdi Zaghouani,et al.  Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification , 2018, LREC.

[7]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[8]  Alexander Erdmann,et al.  Unified Guidelines and Resources for Arabic Dialect Orthography , 2018, LREC.

[9]  Nizar Habash,et al.  A Large Scale Corpus of Gulf Arabic , 2016, LREC.

[10]  Houda Bouamor,et al.  Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic , 2017, MTSUMMIT.

[11]  Mona T. Diab,et al.  LILI: A Simple Language Independent Approach for Language Identification , 2016, COLING.

[12]  Nizar Habash,et al.  Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon , 2014, LREC.

[13]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[14]  Nizar Habash,et al.  The Second QALB Shared Task on Automatic Text Correction for Arabic , 2015, ANLP@ACL.

[15]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[16]  Nizar Habash,et al.  The First QALB Shared Task on Automatic Text Correction for Arabic , 2014, ANLP@EMNLP.

[17]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[18]  Muhammad Abdul-Mageed,et al.  You Tweet What You Speak: A City-Level Dataset of Arabic Dialects , 2018, LREC.

[19]  Abdulhadi Shoufan,et al.  Natural Language Processing for Dialectical Arabic: A Survey , 2015, ANLP@ACL.

[20]  Nizar Habash,et al.  Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models , 2018, EMNLP.

[21]  Nizar Habash,et al.  The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[22]  Ahmed Mohamed Abdel Maksoud Ali,et al.  Multi-dialect Arabic broadcast speech recognition , 2018 .

[23]  J. McCarthy The phonology and morphology of Arabic , 2004 .

[24]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[25]  Josef van Genabith,et al.  Arabic spelling error detection and correction , 2016, Nat. Lang. Eng..

[26]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[27]  Eiichiro Sumita,et al.  Multilingual Spoken Language Corpus Development for Communication Research , 2006, ROCLING/IJCLCLP.

[28]  Nizar Habash,et al.  Automatic Transliteration of Romanized Dialectal Arabic , 2014, CoNLL.

[29]  Kemal Oflazer,et al.  Domain and Dialect Adaptation for Machine Translation into Egyptian Arabic , 2014, ANLP@EMNLP.

[30]  A. Graesser,et al.  Tunisian Arabic Corpus : Creating a written corpus of an “ unwritten ” language , 2011 .

[31]  Abdel-Rahman H. Abu-Melhim Code-Switching and Linguistic Accommodation in Arabic , 1991 .

[32]  Kemal Oflazer,et al.  YouDACC: the Youtube Dialectal Arabic Comment Corpus , 2014, LREC.

[33]  Ryan Cotterell,et al.  A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic , 2014, LREC.

[34]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[35]  Nizar Habash,et al.  Processing Spontaneous Orthography , 2013, NAACL.

[36]  Eirlys E. Davies,et al.  Arabic sociolinguistics: topics in diglossia, gender, identity and politics , 2012 .

[37]  Nizar Habash,et al.  Building a Corpus for Palestinian Arabic: a Preliminary Study , 2014, ANLP@EMNLP.

[38]  Nizar Habash,et al.  Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development , 2014, LREC.

[39]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.