A Report on the VarDial Evaluation Campaign 2020

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020. The campaign included three shared tasks each focusing on a different challenge of language and dialect identification: Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). The campaign attracted 30 teams who enrolled to participate in one or multiple shared tasks and 14 of them submitted runs across the three shared tasks. Finally, 11 papers describing participating systems are published in the VarDial proceedings and referred to in this report.

[1]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[2]  Sampo Pyysalo,et al.  The birth of Romanian BERT , 2020, FINDINGS.

[3]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[4]  Gabriel Bernier-Colborne,et al.  Improving Cuneiform Language Identification with BERT , 2019, Proceedings of the Sixth Workshop on.

[5]  Piyush Mishra,et al.  Geolocation of Tweets with a BiLSTM Regression Model , 2020, VARDIAL.

[6]  Radu Tudor Ionescu,et al.  UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row , 2018, VarDial@COLING 2018.

[7]  Nikola Ljubesic,et al.  TweetGeo - A Tool for Collecting, Processing and Analysing Geo-encoded Linguistic Data , 2016, COLING.

[8]  Radu Tudor Ionescu,et al.  Learning to Identify Arabic and German Dialects using Multiple Kernels , 2017, VarDial.

[9]  Çağrı Çöltekin,et al.  Dialect Identification under Domain Shift: Experiments with Discriminating Romanian and Moldavian , 2020, VARDIAL.

[10]  Francis M. Tyers,et al.  A Report on the Third VarDial Evaluation Campaign , 2019, Proceedings of the Sixth Workshop on.

[11]  Fernando Benites,et al.  ZHAW-InIT - Social Media Geolocation at VarDial 2020 , 2020, VARDIAL.

[12]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[13]  Krister Lindén,et al.  Experiments in Language Variety Geolocation and Dialect Identification , 2020, VarDial@COLING.

[14]  Krister Lindén,et al.  The Finno-Ugric Languages and The Internet Project , 2015 .

[15]  Cristian Popa,et al.  Applying Multilingual and Monolingual Transformer-Based Models for Dialect Identification , 2020, VARDIAL.

[16]  Tommi Jauhiainen,et al.  Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus , 2020, ArXiv.

[17]  Çağrı Çöltekin,et al.  Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation , 2019, Proceedings of the Sixth Workshop on.

[18]  Tommi Jauhiainen,et al.  Wanca in Korp: Text corpora for underresourced Uralic languages , 2019 .

[19]  Radu Tudor Ionescu,et al.  MOROCO: The Moldavian and Romanian Dialectal Corpus , 2019, ACL.

[20]  Yves Scherrer,et al.  Natural language processing for similar languages, varieties, and dialects: A survey , 2020, Natural Language Engineering.

[21]  Yves Scherrer,et al.  A quantitative approach to Swiss German – Dialectometric analyses and comparisons of linguistic levels , 2016 .

[22]  Traian Rebedea,et al.  Exploring the Power of Romanian BERT for Dialect Identification , 2020, VARDIAL.

[23]  Krister Lindén,et al.  Language and Dialect Identification of Cuneiform Texts , 2019, Proceedings of the Sixth Workshop on.

[24]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[25]  Gabriel Bernier-Colborne,et al.  Challenges in Neural Language Identification: NRC at VarDial 2020 , 2020, VarDial@COLING.

[26]  Mans Hulden,et al.  Kernel Density Estimation for Text-Based Geolocation , 2015, AAAI.

[27]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Jörg Tiedemann,et al.  Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[30]  Hong Zhang,et al.  Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams , 2020, VARDIAL.

[31]  Thomas Eckart,et al.  Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[32]  Krister Lindén,et al.  Evaluation of language identification methods using 285 languages , 2017, NODALIDA.

[33]  Ritesh Kumar,et al.  Automatic Identification of Closely-related Indian Languages: Resources and Experiments , 2018, ArXiv.

[34]  Dirk Hovy,et al.  Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting , 2018, EMNLP.

[35]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[36]  Radu Tudor Ionescu,et al.  Combining Deep Learning and String Kernels for the Localization of Swiss German Tweets , 2020, VARDIAL.

[37]  Yves Scherrer,et al.  ArchiMob - A Corpus of Spoken Swiss German , 2016, LREC.

[38]  Dan Cristea,et al.  A dual-encoding system for dialect classification , 2020, VARDIAL.

[39]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[40]  Aoife Cahill,et al.  String Kernels for Native Language Identification: Insights from Behind the Curtains , 2016, CL.

[41]  Alfred Lameli,et al.  Strukturen im Sprachraum : Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland , 2013 .

[42]  Yves Scherrer,et al.  HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models , 2020, VarDial@COLING.

[43]  Radu Tudor Ionescu,et al.  The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification , 2020, International Journal of Intelligent Systems.