Comparing the Performance of CNNs and Shallow Models for Language Identification

In this work we compare the performance of convolutional neural networks and shallow models on three out of the four language identification shared tasks proposed in the VarDial Evaluation Campaign 2021. In our experiments, convolutional neural networks and shallow models yielded comparable performance in the Romanian Dialect Identification (RDI) and the Dravidian Language Identification (DLI) shared tasks, after the training data was augmented, while an ensemble of support vector machines and Naïve Bayes models was the best performing model in the Uralic Language Identification (ULI) task. While the deep learning models did not achieve state-of-the-art performance at the tasks and tended to overfit the data, the ensemble method was one of two methods that beat the existing baseline for the first track of the ULI shared task.

[1]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[2]  Walter Daelemans,et al.  Exploring Classifier Combinations for Language Variety Identification , 2018, VarDial@COLING 2018.

[3]  Sampo Pyysalo,et al.  The birth of Romanian BERT , 2020, FINDINGS.

[4]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[5]  D. Tudoreanu DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification , 2019, Proceedings of the Sixth Workshop on.

[6]  Barbara Plank,et al.  When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages , 2017, VarDial.

[7]  Gabriel Bernier-Colborne,et al.  Improving Cuneiform Language Identification with BERT , 2019, Proceedings of the Sixth Workshop on.

[8]  Marine Carpuat,et al.  The NRC System for Discriminating Similar Languages , 2014, VarDial@COLING.

[9]  Arkaitz Zubiaga,et al.  TweetLID: a benchmark for tweet language identification , 2016, Lang. Resour. Evaluation.

[10]  Krister Lindén,et al.  Experiments in Language Variety Geolocation and Dialect Identification , 2020, VarDial@COLING.

[11]  Cristian Popa,et al.  Applying Multilingual and Monolingual Transformer-Based Models for Dialect Identification , 2020, VARDIAL.

[12]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[13]  John P. McCrae,et al.  Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text , 2020, SLTU.

[14]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[15]  Çağrı Çöltekin,et al.  Dialect Identification under Domain Shift: Experiments with Discriminating Romanian and Moldavian , 2020, VARDIAL.

[16]  Yves Scherrer,et al.  Findings of the VarDial Evaluation Campaign 2021 , 2021, VARDIAL.

[17]  Traian Rebedea,et al.  Exploring the Power of Romanian BERT for Dialect Identification , 2020, VARDIAL.

[18]  Gabriel Bernier-Colborne,et al.  Challenges in Neural Language Identification: NRC at VarDial 2020 , 2020, VarDial@COLING.

[19]  Mark Cieliebak,et al.  Twist Bytes - German Dialect Identification with Data Mining Optimization , 2018, VarDial@COLING 2018.

[20]  A. House,et al.  Toward automatic identification of the language of an utterance. I. Preliminary methodological con , 1977 .

[21]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[22]  Sosuke Kobayashi,et al.  Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations , 2018, NAACL.

[23]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[24]  Hong Zhang,et al.  Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams , 2020, VARDIAL.

[25]  Tommi Jauhiainen,et al.  Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus , 2020, ArXiv.

[26]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[27]  Liang Zou,et al.  Ensemble Methods to Distinguish Mainland and Taiwan Chinese , 2019, Proceedings of the Sixth Workshop on.

[28]  Claude Coulombe,et al.  Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs , 2018, ArXiv.

[29]  Timothy Baldwin,et al.  Accurate Language Identification of Twitter Messages , 2014 .

[30]  Mari Ostendorf,et al.  A Neural Model for Language Identification in Code-Switched Tweets , 2016, CodeSwitch@EMNLP.

[31]  John P. McCrae,et al.  A Sentiment Analysis Dataset for Code-Mixed Malayalam-English , 2020, SLTU.

[32]  Adrien Barbaresi,et al.  An Unsupervised Morphological Criterion for Discriminating Similar Languages , 2016, VarDial@COLING.

[33]  Simon Clematide,et al.  CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects , 2017, VarDial.

[34]  Dirk Hovy,et al.  A Report on the VarDial Evaluation Campaign 2020 , 2020, VARDIAL.

[35]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[37]  Krister Lindén,et al.  HeLI, a Word-Based Backoff Method for Language Identification , 2016, VarDial@COLING.

[38]  Yves Scherrer,et al.  HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models , 2020, VarDial@COLING.

[39]  Radu Tudor Ionescu,et al.  MOROCO: The Moldavian and Romanian Dialectal Corpus , 2019, ACL.

[40]  Çağrı Çöltekin,et al.  Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation , 2019, Proceedings of the Sixth Workshop on.

[41]  Francis M. Tyers,et al.  A Report on the Third VarDial Evaluation Campaign , 2019, Proceedings of the Sixth Workshop on.

[42]  Çagri Çöltekin,et al.  Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing , 2017, VarDial.

[43]  Bharathi Raja Chakravarthi,et al.  KanCMD: Kannada CodeMixed Dataset for Sentiment Analysis and Offensive Language Detection , 2020, PEOPLES.