ASIREM Participation at the Discriminating Similar Languages Shared Task 2016

This paper presents the system built by ASIREM team for the Discriminating between Similar Languages (DSL) Shared task 2016. It describes the system which uses character-based and word-based n-grams separately. ASIREM participated in both sub-tasks (sub-task 1 and sub-task 2) and in both open and closed tracks. For the sub-task 1 which deals with Discriminating between similar languages and national language varieties, the system achieved an accuracy of 87.79% on the closed track, ending up ninth (the best results being 89.38%). In sub-task 2, which deals with Arabic dialect identification, the system achieved its best performance using character-based n-grams (49.67% accuracy), ranking fourth in the closed track (the best result being 51.16%), and an accuracy of 53.18%, ranking first in the open track.

[1]  Omar F. Zaidan,et al.  Crowdsourcing Annotation for Machine Learning in Natural Language Processing Tasks (NON-FINAL VERSION! Proofread version will be uploaded April 30, 2012.) , 2012 .

[2]  John H. L. Hansen,et al.  Supervector pre-processing for PRSVM-based Chinese and Arabic dialect identification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Andreas Stolcke,et al.  Effective Arabic Dialect Classification Using Diverse Phonotactic Models , 2011, INTERSPEECH.

[4]  Shervin Malmasi,et al.  Language Identification using Classifier Ensembles , 2015 .

[5]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[6]  Mona T. Diab,et al.  A Web Application for Dialectal Arabic Text Annotation , 2011 .

[7]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[8]  Jörg Tiedemann,et al.  Efficient Discrimination Between Closely Related Languages , 2012, COLING.

[9]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[10]  Yaser Al-Onaizan,et al.  Improved Sentence-Level Arabic Dialect Classification , 2014, VarDial@COLING.

[11]  Kristin Precoda,et al.  Recent advances in SRI'S IraqComm™ Iraqi Arabic-English speech-to-speech translation system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Cyril Goutte Experiments in Discriminating Similar Languages , 2015 .

[13]  Houda Saadane,et al.  Le traitement automatique de l’arabe dialectalisé : aspects méthodologiques et algorithmiques , 2015 .

[14]  Shervin Malmasi,et al.  Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[15]  Jörg Tiedemann,et al.  Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[16]  J. Hansen,et al.  Dialect Classification via Text-Independent Training and Testing for Arabic, Spanish, and Chinese , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[18]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[19]  John H. L. Hansen,et al.  Arabic Dialect Identification - 'Is the Secret in the Silence?' and Other Observations , 2012, INTERSPEECH.