Computationally efficient discrimination between language varieties with large feature vectors and regularized classifiers

The present contribution revolves around efficient approaches to language classification which have been field-tested in the Vardial evaluation campaign. The methods used in several language identification tasks comprising different language types are presented and their results are discussed, giving insights on real-world application of regularization, linear classifiers and corresponding linguistic features. The use of a specially adapted Ridge classifier proved useful in 2 tasks out of 3. The overall approach (XAC) has slightly outperformed most of the other systems on the DFS task (Dutch and Flemish) and on the ILI task (Indo-Aryan languages), while its comparative performance was poorer in on the GDI task (Swiss German dialects).

[1]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[2]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[3]  Marcos Zampieri,et al.  N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French] , 2013, JEP/TALN/RECITAL.

[4]  Matthew Purver,et al.  A Simple Baseline for Discriminating Similar Languages , 2014, VarDial@COLING.

[5]  Evangelos Spiliotis,et al.  Statistical and Machine Learning forecasting methods: Concerns and ways forward , 2018, PloS one.

[6]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[7]  Adrien Barbaresi Efficient construction of metadata-enhanced web corpora , 2016, WAC@ACL.

[8]  Stephan Vogel,et al.  Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[10]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[11]  Arkaitz Zubiaga,et al.  TweetLID: a benchmark for tweet language identification , 2016, Lang. Resour. Evaluation.

[12]  Thomas Proisl,et al.  SoMaJo: State-of-the-art tokenization for German web and social media texts , 2016, WAC@ACL.

[13]  Mario Bertero,et al.  The Stability of Inverse Problems , 1980 .

[14]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[15]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[16]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[17]  A. Barbaresi Construction de corpus généraux et spécialisés à partir du Web (Ad hoc and general-purpose corpus construction from web sources) , 2015 .

[18]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[19]  Antal van den Bosch,et al.  Exploring Lexical and Syntactic Features for Language Variety Identification , 2017, VarDial.

[20]  Marco Lui,et al.  Classifying English Documents by National Dialect , 2013, ALTA.

[21]  Ritesh Kumar,et al.  Automatic Identification of Closely-related Indian Languages: Resources and Experiments , 2018, ArXiv.

[22]  Arthur E. Hoerl,et al.  Application of ridge analysis to regression problems , 1962 .

[23]  Adrien Barbaresi Discriminating between Similar Languages using Weighted Subword Features , 2017, VarDial.

[24]  Yves Scherrer,et al.  ArchiMob - A Corpus of Spoken Swiss German , 2016, LREC.

[25]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[26]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[27]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[28]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[29]  Jörg Tiedemann,et al.  Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[30]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[31]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[32]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[33]  Adrien Barbaresi,et al.  An Unsupervised Morphological Criterion for Discriminating Similar Languages , 2016, VarDial@COLING.