When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages

We present the results of our participation in the VarDial 4 shared task on discriminating closely related languages. Our submission includes simple traditional models using linear support vector machines (SVMs) and a neural network (NN). The main idea was to leverage language group information. We did so with a two-layer approach in the traditional model and a multi-task objective in the neural network case. Our results confirm earlier findings: simple traditional models outperform neural networks consistently for this task, at least given the amount of systems we could examine in the available time. Our two-layer linear SVM ranked 2nd in the shared task.

[1]  Cyril Goutte Experiments in Discriminating Similar Languages , 2015 .

[2]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[3]  Çağrı Çöltekin,et al.  Discriminating Similar Languages with Linear SVMs and Neural Networks , 2016, VarDial@COLING.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Jörg Tiedemann,et al.  Efficient Discrimination Between Closely Related Languages , 2012, COLING.

[6]  Johannes Bjerva Byte-based Language Identification with Deep Convolutional Networks , 2016, VarDial@COLING.

[7]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[8]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[9]  Ali Selamat,et al.  Improving Language Identification of Web Page Using Optimum Profile , 2011, ICSECS.

[10]  José João Almeida,et al.  Language Identification: a Neural Network Approach , 2014, SLATE.

[11]  Paolo Rosso,et al.  Distributed Representations of Words and Documents for Discriminating Similar Languages , 2015 .

[12]  Andrew Trotman,et al.  A study in language identification , 2012, ADCS.

[13]  Sergiu Nisioi,et al.  Vanilla Classifiers for Distinguishing between Similar Languages , 2016, VarDial@COLING.

[14]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[15]  Marco Lui,et al.  Classifying English Documents by National Dialect , 2013, ALTA.

[16]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[17]  Carlos Gómez-Rodríguez,et al.  Language variety identification in Spanish tweets , 2014, EMNLP 2014.

[18]  Nikola Ljubesic,et al.  Discriminating Between Closely Related Languages on Twitter , 2015, Informatica.

[19]  Jörg Tiedemann,et al.  Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Ralf D. Brown,et al.  Selecting and Weighting N-Grams to Identify 1100 Languages , 2013, TSD.

[22]  Iñaki Alegria,et al.  Comparing Two Basic Methods for Discriminating Between Similar Languages and Varieties , 2016, VarDial@COLING.

[23]  Hidayet Takci,et al.  Minimal feature set in language identification and finding suitable classification method with it , 2012 .

[24]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[25]  Ondrej Bojar,et al.  LanideNN: Multilingual Language Identification on Character Window , 2017, EACL 2017.

[26]  Shervin Malmasi,et al.  Language Identification using Classifier Ensembles , 2015 .

[27]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[28]  Wouter Weerkamp,et al.  Semi-Supervised Priors for Microblog Language Identification , 2011 .

[29]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[30]  Leila Kosseim,et al.  N-gram and Neural Language Models for Discriminating Similar Languages , 2016, VarDial@COLING.

[31]  Jilei Tian,et al.  Scalable neural network based language identification from written text , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[32]  Marc Najork,et al.  Boot-Strapping Language Identifiers for Short Colloquial Postings , 2013, ECML/PKDD.

[33]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.