Arabic Dialect Identification Using a Parallel Multidialectal Corpus

We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a meta-classifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving \(74\,\%\) accuracy against a random baseline of \(16.7\,\%\) and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as \(94\,\%\), but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (\(76\,\%\)). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2, 000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with \(74\,\%\) accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with \(97\,\%\) accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.

[1]  Shervin Malmasi,et al.  Language Transfer Hypotheses with Linear SVM Weights , 2014, EMNLP.

[2]  Shervin Malmasi,et al.  Measuring Feature Diversity in Native Language Identification , 2015, BEA@NAACL-HLT.

[3]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[4]  Nizar Habash,et al.  Spoken Arabic Dialect Identification Using Phonotactic Modeling , 2009, SEMITIC@EACL.

[5]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[6]  Nizar Habash,et al.  Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon , 2014, LREC.

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[9]  Joel R. Tetreault,et al.  Oracle and Human Baselines for Native Language Identification , 2015, BEA@NAACL-HLT.

[10]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[11]  Thomas Gottron,et al.  A Comparison of Language Identification Approaches on Short, Query-Style Texts , 2010, ECIR.

[12]  Shervin Malmasi,et al.  Automatic Language Identification for Persian and Dari texts , 2015 .

[13]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[14]  Manfred Stede,et al.  Lexical Choice Criteria in Language Generation , 1993, EACL.

[15]  Roxana Girju,et al.  Mining the Web for the Induction of a Dialectical Arabic Lexicon , 2010, LREC.

[16]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[17]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[18]  Hassan Sajjad,et al.  Verifiably Effective Arabic Dialect Identification , 2014, EMNLP.

[19]  Shervin Malmasi,et al.  Arabic Native Language Identification , 2014, ANLP@EMNLP.

[20]  Shervin Malmasi,et al.  NLI Shared Task 2013: MQ Submission , 2013, BEA@NAACL-HLT.

[21]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[22]  Shervin Malmasi,et al.  Language Identification using Classifier Ensembles , 2015 .

[23]  Shervin Malmasi,et al.  Large-Scale Native Language Identification with Cross-Corpus Evaluation , 2015, NAACL.

[24]  Graeme Hirst,et al.  Measuring Interlanguage: Native Language Identification with L1-influence Metrics , 2012, LREC.