论文信息 - Computationally efficient discrimination between language varieties with large feature vectors and regularized classifiers

Computationally efficient discrimination between language varieties with large feature vectors and regularized classifiers

The present contribution revolves around efficient approaches to language classification which have been field-tested in the Vardial evaluation campaign. The methods used in several language identification tasks comprising different language types are presented and their results are discussed, giving insights on real-world application of regularization, linear classifiers and corresponding linguistic features. The use of a specially adapted Ridge classifier proved useful in 2 tasks out of 3. The overall approach (XAC) has slightly outperformed most of the other systems on the DFS task (Dutch and Flemish) and on the ILI task (Indo-Aryan languages), while its comparative performance was poorer in on the GDI task (Swiss German dialects).

Adrien Barbaresi | A. Barbaresi

[1] Cyril Goutte,et al. Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[2] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[3] Marcos Zampieri,et al. N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French] , 2013, JEP/TALN/RECITAL.

[4] Matthew Purver,et al. A Simple Baseline for Discriminating Similar Languages , 2014, VarDial@COLING.

[5] Evangelos Spiliotis,et al. Statistical and Machine Learning forecasting methods: Concerns and ways forward , 2018, PloS one.

[6] Preslav Nakov,et al. Overview of the DSL Shared Task 2015 , 2015 .

[7] Adrien Barbaresi. Efficient construction of metadata-enhanced web corpora , 2016, WAC@ACL.

[8] Stephan Vogel,et al. Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9] Jörg Tiedemann,et al. A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[10] Preslav Nakov,et al. Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[11] Arkaitz Zubiaga,et al. TweetLID: a benchmark for tweet language identification , 2016, Lang. Resour. Evaluation.