UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row

We present a machine learning approach that ranked on the first place in the Arabic Dialect Identification (ADI) Closed Shared Tasks of the 2018 VarDial Evaluation Campaign. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech or phonetic transcripts, we also use a kernel based on dialectal embeddings generated from audio recordings by the organizers. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Preliminary experiments indicate that KRR provides better classification results. Our approach is shallow and simple, but the empirical results obtained in the 2018 ADI Closed Shared Task prove that it achieves the best performance. Furthermore, our top macro-F1 score (58.92%) is significantly better than the second best score (57.59%) in the 2018 ADI Shared Task, according to the statistical significance test performed by the organizers. Nevertheless, we obtain even better post-competition results (a macro-F1 score of 62.28%) using the audio embeddings released by the organizers after the competition. With a very similar approach (that did not include phonetic features), we also ranked first in the ADI Closed Shared Tasks of the 2017 VarDial Evaluation Campaign, surpassing the second best method by 4.62%. We therefore conclude that our multiple kernel learning method is the best approach to date for Arabic dialect identification.

[1]  Aoife Cahill,et al.  Can characters reveal your native language? A language-independent approach to native language identification , 2014, EMNLP.

[2]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[3]  Radu Tudor Ionescu,et al.  UnibucKernel: An Approach for Arabic Dialect Identification Based on Multiple String Kernels , 2016, VarDial@COLING.

[4]  Stephan Vogel,et al.  Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Shervin Malmasi,et al.  Arabic Dialect Identification in Speech Transcripts , 2016, VarDial@COLING.

[6]  Cristian Grozea,et al.  Kernel Methods and String Kernels for Authorship Analysis , 2012, CLEF.

[7]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Hugo Jair Escalante,et al.  Local Histograms of Character N-grams for Authorship Attribution , 2011, ACL.

[9]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[10]  Radu Tudor Ionescu Local Rank Distance , 2013, 2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[11]  James R. Glass,et al.  Exploiting Convolutional Neural Networks for Phonotactic Based Dialect Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[13]  Radu Tudor Ionescu,et al.  Automated essay scoring with string kernels and word embeddings , 2018, ACL.

[14]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[15]  Nizar Habash,et al.  Spoken Arabic Dialect Identification Using Phonotactic Modeling , 2009, SEMITIC@EACL.

[16]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[17]  Hassan Sajjad,et al.  Verifiably Effective Arabic Dialect Identification , 2014, EMNLP.

[18]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[19]  Radu Tudor Ionescu,et al.  HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages , 2017, KES.

[20]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[21]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[22]  Radu Tudor Ionescu,et al.  Knowledge Transfer between Computer Vision and Text Mining , 2016, Advances in Computer Vision and Pattern Recognition.

[23]  Rohit J. Kate,et al.  Using String-Kernels for Learning Semantic Parsers , 2006, ACL.

[24]  Alexandru I. Tomescu,et al.  A Rank-Based Sequence Aligner with Applications in Phylogenetic Analysis , 2014, PloS one.

[25]  Radu Tudor Ionescu,et al.  Learning to Identify Arabic and German Dialects using Multiple Kernels , 2017, VarDial.

[26]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[27]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[28]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[29]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[31]  Aoife Cahill,et al.  String Kernels for Native Language Identification: Insights from Behind the Curtains , 2016, CL.

[32]  Radu Tudor Ionescu,et al.  Can string kernels pass the test of time in Native Language Identification? , 2017, BEA@EMNLP.

[33]  Radu Tudor Ionescu A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification , 2015, ICONIP.

[34]  Radu Tudor Ionescu,et al.  The Story of the Characters, the DNA and the Native Language , 2013, BEA@NAACL-HLT.

[35]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[36]  Paolo Rosso,et al.  Single and Cross-domain Polarity Classification using String Kernels , 2017, EACL.

[37]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[38]  Shervin Malmasi,et al.  Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[39]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[40]  Shervin Malmasi,et al.  Arabic Dialect Identification Using iVectors and ASR Transcripts , 2017, VarDial.

[41]  James R. Glass,et al.  Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition , 2018, Odyssey.