Vanilla Classifiers for Distinguishing between Similar Languages

In this paper we describe the submission of the UniBuc-NLP team for the Discriminating between Similar Languages Shared Task, DSL 2016. We present and analyze the results we obtained in the closed track of sub-task 1 (Similar languages and language varieties) and sub-task 2 (Arabic dialects). For sub-task 1 we used a logistic regression classifier with tf-idf feature weighting and for sub-task 2 a character-based string kernel with an SVM classifier. Our results show that good accuracy scores can be obtained with limited feature and model engineering. While certain limitations are to be acknowledged, our approach worked surprisingly well for out-of-domain, social media data, with 0.898 accuracy (3rd place) for dataset B1 and 0.838 accuracy (4th place) for dataset B2.

[1]  Thomas Gottron,et al.  A Comparison of Language Identification Approaches on Short, Query-Style Texts , 2010, ECIR.

[2]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[3]  Bali Ranaivo-Malancon,et al.  Automatic Identification of Close Languages - Case study: Malay and Indonesian , 1970 .

[4]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[5]  Fatiha Sadat,et al.  Automatic identification of arabic dialects in social media , 2014, SoMeRA@SIGIR.

[6]  Cyril Goutte Experiments in Discriminating Similar Languages , 2015 .

[7]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Carlos Gómez-Rodríguez,et al.  Language variety identification in Spanish tweets , 2014, EMNLP 2014.

[10]  Jörg Tiedemann,et al.  Efficient Discrimination Between Closely Related Languages , 2012, COLING.

[11]  Cristian Grozea,et al.  Encoplot - Performance in the Second International Plagiarism Detection Challenge - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[12]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[13]  Shervin Malmasi,et al.  Language Identification using Classifier Ensembles , 2015 .

[14]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[15]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[16]  Djoerd Hiemstra,et al.  An exploration of language identification techniques for the Dutch folktale database , 2012 .

[17]  Jörg Tiedemann,et al.  Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[18]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..