Deep Models for Arabic Dialect Identification on Benchmarked Data

The Arabic Online Commentary (AOC) (Zaidan and Callison-Burch, 2011) is a large-scale repos-itory of Arabic dialects with manual labels for4varieties of the language. Existing dialect iden-tification models exploiting the dataset pre-date the recent boost deep learning brought to NLPand hence the data are not benchmarked for use with deep learning, nor is it clear how much neural networks can help tease the categories in the data apart. We treat these two limitations:We (1) benchmark the data, and (2) empirically test6different deep learning methods on thetask, comparing peformance to several classical machine learning models under different condi-tions (i.e., both binary and multi-way classification). Our experimental results show that variantsof (attention-based) bidirectional recurrent neural networks achieve best accuracy (acc) on thetask, significantly outperforming all competitive baselines. On blind test data, our models reach87.65%acc on the binary task (MSA vs. dialects),87.4%acc on the 3-way dialect task (Egyptianvs. Gulf vs. Levantine), and82.45%acc on the 4-way variants task (MSA vs. Egyptian vs. Gulfvs. Levantine). We release our benchmark for future work on the dataset

[1]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[2]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[3]  Muhammad Abdul-Mageed,et al.  Subjectivity and sentiment analysis of Arabic as a morophologically-rich language , 2015 .

[4]  Yaser Al-Onaizan,et al.  Improved Sentence-Level Arabic Dialect Classification , 2014, VarDial@COLING.

[5]  Jeffrey Nichols,et al.  Home Location Identification of Twitter Users , 2014, TIST.

[6]  Wei Shi,et al.  Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification , 2016, ACL.

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  Amanda Lee Hughes,et al.  Social Media in Disaster Communication , 2018 .

[9]  James R. Glass,et al.  Exploiting Convolutional Neural Networks for Phonotactic Based Dialect Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[11]  Ryan Cotterell,et al.  A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic , 2014, LREC.

[12]  Muhammad Abdul-Mageed,et al.  You Tweet What You Speak: A City-Level Dataset of Arabic Dialects , 2018, LREC.

[13]  Yonatan Belinkov,et al.  A Character-level Convolutional Neural Network for Distinguishing Similar Languages and Dialects , 2016, VarDial@COLING.

[14]  James R. Glass,et al.  Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition , 2018, Odyssey.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Mona T. Diab,et al.  AIDA: Identifying Code Switching in Informal Arabic Text , 2014, CodeSwitch@EMNLP.

[17]  Muhammad Abdul-Mageed,et al.  Modeling Arabic subjectivity and sentiment in lexical space , 2017, Inf. Process. Manag..

[18]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Antonio Jimeno-Yepes,et al.  Investigating Public Health Surveillance using Twitter , 2015, BioNLP@IJCNLP.

[21]  Mona T. Diab,et al.  COLABA : Arabic Dialect Annotation and Processing , 2011 .

[22]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[23]  Jiajun Liu,et al.  Understanding Human Mobility from Twitter , 2014, PloS one.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Muhammad Abdul-Mageed,et al.  Recognizing Pathogenic Empathy in Social Media , 2017, ICWSM.

[26]  Wang Ling,et al.  Microblogs as Parallel Corpora , 2013, ACL.

[27]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[28]  Zhi Jin,et al.  Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths , 2015, EMNLP.

[29]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[30]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[31]  Ming Wen,et al.  Building a National Neighborhood Dataset From Geotagged Twitter Data for Indicators of Happiness, Diet, and Physical Activity , 2016, JMIR public health and surveillance.

[32]  Timothy Baldwin,et al.  Semi-supervised User Geolocation via Graph Convolutional Networks , 2018, ACL.

[33]  Fei Huang Improved Arabic Dialect Classification with Social Media Data , 2015, EMNLP.

[34]  M. Barthelemy,et al.  From mobile phone data to the spatial structure of cities , 2014, Scientific Reports.

[35]  Carlo Ratti,et al.  Geo-located Twitter as proxy for global mobility patterns , 2013, Cartography and geographic information science.

[36]  Yulia Tsvetkov,et al.  Incorporating Dialectal Variability for Socially Equitable Language Identification , 2017, ACL.

[37]  James R. Glass,et al.  MIT-QCRI Arabic dialect identification system for the 2017 multi-genre broadcast challenge , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[38]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[39]  Ming Wen,et al.  Twitter-derived neighborhood characteristics associated with obesity and diabetes , 2017, Scientific Reports.

[40]  Krister Lindén,et al.  Evaluation of language identification methods using 285 languages , 2017, NODALIDA.

[41]  Steven Bird,et al.  The Human Language Project: Building a Universal Corpus of the World's Languages , 2010, ACL.

[42]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[43]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[44]  Ondrej Bojar,et al.  LanideNN: Multilingual Language Identification on Character Window , 2017, EACL 2017.

[45]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[46]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[47]  Timothy Baldwin,et al.  A Neural Model for User Geolocation and Lexical Dialectology , 2017, ACL.

[48]  Hassan Sajjad,et al.  Verifiably Effective Arabic Dialect Identification , 2014, EMNLP.

[49]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[50]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[51]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[52]  Zhiyuan Liu,et al.  A C-LSTM Neural Network for Text Classification , 2015, ArXiv.