Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation

This paper describes the work done by team tearsofjoy participating in the VarDial 2019 Evaluation Campaign. We developed two systems based on Support Vector Machines: SVM with a flat combination of features and SVM ensembles. We participated in all language/dialect identification tasks, as well as the Moldavian vs. Romanian cross-dialect topic identification (MRC) task. Our team achieved first place in German Dialect identification (GDI) and MRC subtasks 2 and 3, second place in the simplified variant of Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT) as well as Cuneiform Language Identification (CLI), and third and fifth place in DMT traditional and MRC subtask 1 respectively. In most cases, the SVM with a flat combination of features performed better than SVM ensembles. Besides describing the systems and the results obtained by them, we provide a tentative comparison between the feature combination methods, and present additional experiments with a method of adaptation to the test set, which may indicate potential pitfalls with some of the data sets.

[1]  Kagan Tumer,et al.  Classifier ensembles: Select real-world applications , 2008, Inf. Fusion.

[2]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[3]  Çağrı Çöltekin,et al.  Discriminating Similar Languages with Linear SVMs and Neural Networks , 2016, VarDial@COLING.

[4]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[5]  Shervin Malmasi,et al.  Language Identification using Classifier Ensembles , 2015 .

[6]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[7]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[8]  Çagri Çöltekin,et al.  Tübingen-Oslo Team at the VarDial 2018 Evaluation Campaign: An Analysis of N-gram Features in Language Variety Identification , 2018, VarDial@COLING 2018.

[9]  Shervin Malmasi,et al.  Native Language Identification With Classifier Stacking and Ensembles , 2018, CL.

[10]  Çagri Çöltekin,et al.  Fewer features perform well at Native Language Identification task , 2017, BEA@EMNLP.

[11]  Francis M. Tyers,et al.  A Report on the Third VarDial Evaluation Campaign , 2019, Proceedings of the Sixth Workshop on.

[12]  Mark Cieliebak,et al.  Twist Bytes - German Dialect Identification with Data Mining Optimization , 2018, VarDial@COLING 2018.

[13]  Shervin Malmasi,et al.  Arabic Dialect Identification Using iVectors and ASR Transcripts , 2017, VarDial.

[14]  Çagri Çöltekin,et al.  Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing , 2017, VarDial.

[15]  Radu Tudor Ionescu,et al.  MOROCO: The Moldavian and Romanian Dialectal Corpus , 2019, ACL.

[16]  Barbara Plank,et al.  When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages , 2017, VarDial.

[17]  Tony McEnery,et al.  The Lancaster Corpus of Mandarin Chinese , 2003 .

[18]  Krister Lindén,et al.  Language and Dialect Identification of Cuneiform Texts , 2019, Proceedings of the Sixth Workshop on.

[19]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Krister Lindén,et al.  Iterative Language Model Adaptation for Indo-Aryan Language Identification , 2018, VarDial@COLING 2018.

[22]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[23]  Shervin Malmasi,et al.  German Dialect Identification in Interview Transcriptions , 2017, VarDial.

[24]  Josef Kittler,et al.  Combining classifiers: A theoretical framework , 1998, Pattern Analysis and Applications.

[25]  Simon Clematide,et al.  CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects , 2017, VarDial.

[26]  Krister Lindén,et al.  HeLI, a Word-Based Backoff Method for Language Identification , 2016, VarDial@COLING.

[27]  John Nerbonne,et al.  An explicit statistical model of learning lexical segmentation using multiple cues , 2014, EACL 2014.

[28]  Liviu P. Dinu,et al.  German Dialect Identification Using Classifier Ensembles , 2018, VarDial@COLING 2018.

[29]  Mohamed Ali Character Level Convolutional Neural Network for German Dialect Identification , 2018, VarDial@COLING 2018.

[30]  Krister Lindén,et al.  HeLI-based Experiments in Swiss German Dialect Identification , 2018, VarDial@COLING 2018.

[31]  Yves Scherrer,et al.  ArchiMob - A Corpus of Spoken Swiss German , 2016, LREC.

[32]  Chu-Ren Huang,et al.  SINICA CORPUS : Design Methodology for Balanced Corpora , 1996, PACLIC.

[33]  Adrien Barbaresi,et al.  Computationally efficient discrimination between language varieties with large feature vectors and regularized classifiers , 2018, VarDial@COLING 2018.

[34]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[35]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.