Findings of the VarDial Evaluation Campaign 2021

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021. The campaign was part of the eighth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2021. Four separate shared tasks were included this year: Dravidian Language Identification (DLI), Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). DLI was organized for the first time and the other three continued a series of tasks from previous evaluation campaigns.

[1]  Traian Rebedea,et al.  Dialect Identification through Adversarial Learning and Knowledge Distillation on Romanian BERT , 2021, VARDIAL.

[2]  Sajeetha Thavareesan,et al.  Word embedding-based Part of Speech tagging in Tamil texts , 2020, 2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS).

[3]  Marcos Zampieri,et al.  Comparing Approaches to Dravidian Language Identification , 2021, VARDIAL.

[4]  Hong Zhang,et al.  Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams , 2020, VARDIAL.

[5]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[6]  Sinnathamby Mahesan,et al.  Sentiment Lexicon Expansion using Word2vec and fastText for Sentiment Prediction in Tamil texts , 2020, 2020 Moratuwa Engineering Research Conference (MERCon).

[7]  Çağrı Çöltekin,et al.  Dialect Identification under Domain Shift: Experiments with Discriminating Romanian and Moldavian , 2020, VARDIAL.

[8]  Andrea Ceolin,et al.  Comparing the Performance of CNNs and Shallow Models for Language Identification , 2021, VARDIAL.

[9]  Chris Biemann,et al.  Exploiting the Leipzig Corpora Collection , 2006 .

[10]  Yves Scherrer,et al.  Social Media Variety Geolocation with geoBERT , 2021, VARDIAL.

[11]  Tommi Jauhiainen,et al.  Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus , 2020, ArXiv.

[12]  Krister Lindén,et al.  Language and Dialect Identification of Cuneiform Texts , 2019, Proceedings of the Sixth Workshop on.

[13]  Radu Tudor Ionescu,et al.  MOROCO: The Moldavian and Romanian Dialectal Corpus , 2019, ACL.

[14]  Krister Lindén,et al.  Evaluation of language identification methods using 285 languages , 2017, NODALIDA.

[15]  Yves Scherrer,et al.  Natural language processing for similar languages, varieties, and dialects: A survey , 2020, Natural Language Engineering.

[16]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[17]  John P. McCrae,et al.  A Survey of Current Datasets for Code-Switching Research , 2020, 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS).

[18]  John P. McCrae,et al.  Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding , 2020, 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS).

[19]  John P. McCrae,et al.  Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text , 2020, SLTU.

[20]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[21]  Marine Carpuat,et al.  The NRC System for Discriminating Similar Languages , 2014, VarDial@COLING.

[22]  Traian Rebedea,et al.  Exploring the Power of Romanian BERT for Dialect Identification , 2020, VARDIAL.

[23]  Gabriel Bernier-Colborne,et al.  Challenges in Neural Language Identification: NRC at VarDial 2020 , 2020, VarDial@COLING.

[24]  Bharathi Raja Chakravarthi,et al.  KanCMD: Kannada CodeMixed Dataset for Sentiment Analysis and Offensive Language Detection , 2020, PEOPLES.

[25]  Yves Bestgen,et al.  Optimizing a Supervised Classifier for a Difficult Language Identification Problem , 2021, VARDIAL.

[26]  Radu Tudor Ionescu,et al.  UnibucKernel: Geolocating Swiss German Jodels Using Ensemble Learning , 2021, VARDIAL.

[27]  Dirk Hovy,et al.  Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting , 2018, EMNLP.

[28]  Tommi Jauhiainen,et al.  Wanca in Korp: Text corpora for underresourced Uralic languages , 2019 .

[29]  Radu Tudor Ionescu,et al.  The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification , 2020, International Journal of Intelligent Systems.

[30]  Nikola Ljubesic,et al.  TweetGeo - A Tool for Collecting, Processing and Analysing Geo-encoded Linguistic Data , 2016, COLING.

[31]  John P. McCrae,et al.  A Sentiment Analysis Dataset for Code-Mixed Malayalam-English , 2020, SLTU.

[32]  Francis M. Tyers,et al.  A Report on the Third VarDial Evaluation Campaign , 2019, Proceedings of the Sixth Workshop on.

[33]  Krister Lindén,et al.  Naive Bayes-based Experiments in Romanian Dialect Identification , 2021, VARDIAL.

[34]  Krister Lindén,et al.  Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models , 2019, Proceedings of the Sixth Workshop on.

[35]  Gabriel Bernier-Colborne,et al.  N-gram and Neural Models for Uralic Language Identification: NRC at VarDial 2021 , 2021, VARDIAL.

[36]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[37]  Sinnathamby Mahesan,et al.  Sentiment Analysis in Tamil Texts: A Study on Machine Learning Techniques and Feature Representation , 2019, 2019 14th Conference on Industrial and Information Systems (ICIIS).

[38]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[39]  Krister Lindén,et al.  HeLI-based Experiments in Swiss German Dialect Identification , 2018, VarDial@COLING 2018.

[40]  Asoka Chakravarthi,et al.  Leveraging orthographic information to improve machine translation of under-resourced languages , 2020 .

[41]  Krister Lindén,et al.  Experiments in Language Variety Geolocation and Dialect Identification , 2020, VarDial@COLING.

[42]  Radu Tudor Ionescu,et al.  Combining Deep Learning and String Kernels for the Localization of Swiss German Tweets , 2020, VARDIAL.

[43]  Gabriel Bernier-Colborne,et al.  Improving Cuneiform Language Identification with BERT , 2019, Proceedings of the Sixth Workshop on.

[44]  Yves Bestgen,et al.  Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets , 2017, VarDial.