Exploring the Power of Romanian BERT for Dialect Identification

Dialect identification represents a key aspect for improving a series of tasks, such as opinion mining, considering that the location of the speaker can greatly influence the attitude towards a subject. In this work, we describe the systems developed by our team for VarDial 2020: Romanian Dialect Identification, a task specifically created for challenging participants to solve the dialect identification problem for an under-resourced language, such as Romanian. More specifically, we introduce a series of neural architectures based on Transformers, that combine a BERT model exclusively pre-trained on the Romanian language with several other techniques, such as adversarial training or character-level embeddings. By using a custom Romanian BERT model, we were able to reach a macro-F1 score of 64.75 on the test dataset, thus allowing us to be ranked 5th out of 8 participant teams. Moreover, we improved the F1-scores reported by the authors of MOROCO with over 1.7%, obtaining a 96.23% macro-F1 score, alongside micro and weighted F1 scores of 96.25%.

[1]  Muhammad Abdul-Mageed,et al.  DiaNet: BERT and Hierarchical Attention Multi-Task Learning of Fine-Grained Dialect , 2019, ArXiv.

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Shervin Malmasi,et al.  German Dialect Identification in Interview Transcriptions , 2017, VarDial.

[4]  D. Tudoreanu DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification , 2019, Proceedings of the Sixth Workshop on.

[5]  Dirk Hovy,et al.  A Report on the VarDial Evaluation Campaign 2020 , 2020, VARDIAL.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Stefan Daniel Dumitrescu,et al.  RoWordNet – A Python API for the Romanian WordNet , 2018, 2018 10th International Conference on Electronics, Computers and Artificial Intelligence (ECAI).

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[10]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[11]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[12]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Stefan Trausan-Matu,et al.  SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification , 2019, Proceedings of the Sixth Workshop on.

[15]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[16]  Verginica Barbu Mititelu,et al.  MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language , 2019, BioNLP@ACL.

[17]  Houda Bouamor,et al.  Fine-Grained Arabic Dialect Identification , 2018, COLING.

[18]  Muhammad Abdul-Mageed,et al.  No Army, No Navy: BERT Semi-Supervised Learning of Arabic Dialects , 2019, WANLP@ACL 2019.

[19]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[20]  Mohamed Ali,et al.  Character Level Convolutional Neural Network for Arabic Dialect Identification , 2018, VarDial@COLING 2018.

[21]  William Stafford Noble,et al.  Support vector machine , 2013 .

[22]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[23]  Verginica Barbu Mititelu,et al.  The Reference Corpus of the Contemporary Romanian Language (CoRoLa) , 2018, LREC.

[24]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[25]  Radu Tudor Ionescu,et al.  Combining Deep Learning and String Kernels for the Localization of Swiss German Tweets , 2020, VARDIAL.

[26]  Sampo Pyysalo,et al.  The birth of Romanian BERT , 2020, FINDINGS.

[27]  Nizar Habash,et al.  The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[28]  Çağrı Çöltekin,et al.  Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation , 2019, Proceedings of the Sixth Workshop on.

[29]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Stefan Daniel Dumitrescu,et al.  Introducing RONEC - the Romanian Named Entity Corpus , 2020, LREC.

[32]  Radu Tudor Ionescu,et al.  MOROCO: The Moldavian and Romanian Dialectal Corpus , 2019, ACL.

[33]  Yu Cheng,et al.  FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2020, ICLR.

[34]  Horia Cucu,et al.  RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition , 2020, LREC.

[35]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[36]  Radu Tudor Ionescu,et al.  UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row , 2018, VarDial@COLING 2018.

[37]  Muhammad Abdul-Mageed,et al.  Deep Models for Arabic Dialect Identification on Benchmarked Data , 2018, VarDial@COLING 2018.