论文信息 - Exploring the Power of Romanian BERT for Dialect Identification

Exploring the Power of Romanian BERT for Dialect Identification

Dialect identification represents a key aspect for improving a series of tasks, such as opinion mining, considering that the location of the speaker can greatly influence the attitude towards a subject. In this work, we describe the systems developed by our team for VarDial 2020: Romanian Dialect Identification, a task specifically created for challenging participants to solve the dialect identification problem for an under-resourced language, such as Romanian. More specifically, we introduce a series of neural architectures based on Transformers, that combine a BERT model exclusively pre-trained on the Romanian language with several other techniques, such as adversarial training or character-level embeddings. By using a custom Romanian BERT model, we were able to reach a macro-F1 score of 64.75 on the test dataset, thus allowing us to be ranked 5th out of 8 participant teams. Moreover, we improved the F1-scores reported by the authors of MOROCO with over 1.7%, obtaining a 96.23% macro-F1 score, alongside micro and weighted F1 scores of 96.25%.

[1] Muhammad Abdul-Mageed,et al. DiaNet: BERT and Hierarchical Attention Multi-Task Learning of Fine-Grained Dialect , 2019, ArXiv.

[2] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[3] Shervin Malmasi,et al. German Dialect Identification in Interview Transcriptions , 2017, VarDial.

[4] D. Tudoreanu. DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification , 2019, Proceedings of the Sixth Workshop on.

[5] Dirk Hovy,et al. A Report on the VarDial Evaluation Campaign 2020 , 2020, VARDIAL.

[6] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7] Stefan Daniel Dumitrescu,et al. RoWordNet – A Python API for the Romanian WordNet , 2018, 2018 10th International Conference on Electronics, Computers and Artificial Intelligence (ECAI).

[8] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[9] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[10] Eva Schlinger,et al. How Multilingual is Multilingual BERT? , 2019, ACL.

[11] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[12] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[13] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[14] Stefan Trausan-Matu,et al. SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification , 2019, Proceedings of the Sixth Workshop on.

[15] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[16] Verginica Barbu Mititelu,et al. MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language , 2019, BioNLP@ACL.

[17] Houda Bouamor,et al. Fine-Grained Arabic Dialect Identification , 2018, COLING.

[18] Muhammad Abdul-Mageed,et al. No Army, No Navy: BERT Semi-Supervised Learning of Arabic Dialects , 2019, WANLP@ACL 2019.

[19] Benoît Sagot,et al. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .