Creating Parallel Arabic Dialect Corpus: Pitfalls to Avoid

Creating parallel corpora is a difficult issue that many researches try to deal with. In the context of under-resourced languages like Arabic dialects this issue is more complicated due to the nature of these spoken languages. In this paper, we share our experiment of creating a Parallel Corpus which contain several dialects and Modern Standard Arabic(MSA). We attempt to highlight the most important choices that we did and how good were these choices.

[1]  David Graff,et al.  Lexicon Development for Varieties of Spoken Colloquial Arabic , 2006, LREC.

[2]  Ann Bies,et al.  Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools , 2004 .

[3]  A. BOUDLAL,et al.  A Morphosyntactic analysis system for Arabic texts , 2010 .

[4]  K. Almeman,et al.  Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[5]  Robert Hetzron,et al.  Semitic Languages , 1954, PMLA/Publications of the Modern Language Association of America.

[6]  Wajdi Zaghouani Critical Survey of the Freely Available Arabic Corpora , 2017, ArXiv.

[7]  Ryan Cotterell,et al.  A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic , 2014, LREC.

[8]  Lamia Hadrich Belguith,et al.  Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model , 2013, HyTra@ACL.

[9]  Lamia Hadrich Belguith,et al.  Morphological Analysis of Tunisian Dialect , 2013, IJCNLP.

[10]  Kevin Duh,et al.  Lexicon Acquisition for Dialectal Arabic Using Transductive Learning , 2006, EMNLP.

[11]  Karima Meftouh,et al.  Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus , 2015, PACLIC.

[12]  Roxana Girju,et al.  Mining the Web for the Induction of a Dialectical Arabic Lexicon , 2010, LREC.

[13]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[14]  Kareem Darwish,et al.  Using Twitter to Collect a Multi-Dialectal Corpus of Arabic , 2014, ANLP@EMNLP.

[15]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[16]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[17]  Nizar Habash,et al.  Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development , 2014, LREC.

[18]  Nizar Habash,et al.  Developing and Using a Pilot Dialectal Arabic Treebank , 2006, LREC.

[19]  Seth Kulick,et al.  From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News , 2010, LREC.

[20]  Roxana Girju,et al.  YADAC: Yet another Dialectal Arabic Corpus , 2012, LREC.

[21]  C. Anton Rytting,et al.  Spelling Correction for Dialectal Arabic Dictionary Lookup , 2011, TALIP.