What the [MASK]? Making Sense of Language-Specific BERT Models

Recently, Natural Language Processing (NLP) has witnessed an impressive progress in many areas, due to the advent of novel, pretrained contextual representation models. In particular, Devlin et al. (2019) proposed a model, called BERT (Bidirectional Encoder Representations from Transformers), which enables researchers to obtain state-of-the art performance on numerous NLP tasks by fine-tuning the representations on their data set and task, without the need for developing and training highly-specific architectures. The authors also released multilingual BERT (mBERT), a model trained on a corpus of 104 languages, which can serve as a universal language model. This model obtained impressive results on a zero-shot cross-lingual natural inference task. Driven by the potential of BERT models, the NLP community has started to investigate and generate an abundant number of BERT models that are trained on a particular language, and tested on a specific data domain and task. This allows us to evaluate the true potential of mBERT as a universal language model, by comparing it to the performance of these more specific models. This paper presents the current state of the art in language-specific BERT models, providing an overall picture with respect to different dimensions (i.e. architectures, data domains, and tasks). Our aim is to provide an immediate and straightforward overview of the commonalities and differences between Language-Specific (language-specific) BERT models and mBERT. We also provide an interactive and constantly updated website that can be used to explore the information we have collected, at this https URL.

[1]  Roberto Navigli,et al.  SemEval-2013 Task 12: Multilingual Word Sense Disambiguation , 2013, *SEMEVAL.

[2]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[3]  David Ifeoluwa Adelani,et al.  Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yorùbá and Twi , 2019, LREC.

[4]  Jason Baldridge,et al.  PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[5]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[6]  Mazzei Alessandro,et al.  Overview of the EVALITA 2016 Part of speech on twitter for Italian task , 2016 .

[7]  Mikhail Arkhipov,et al.  Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language , 2019, ArXiv.

[8]  Alexey Sorokin,et al.  Tuning Multilingual Transformers for Language-Specific Named Entity Recognition , 2019, BSNLP@ACL.

[9]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[10]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Pieter Delobelle,et al.  RobBERT: a Dutch RoBERTa-based Language Model , 2020, EMNLP.

[13]  Marie-Francine Moens,et al.  Binary and Multitask Classification Model for Dutch Anaphora Resolution: Die/Dat Prediction , 2020, ArXiv.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[16]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[17]  Tapio Salakoski,et al.  Multilingual is not enough: BERT for Finnish , 2019, ArXiv.

[18]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[19]  Cristina Bosco,et al.  Overview of the EVALITA 2016 Part Of Speech on TWitter for ITAlian Task , 2016, CLiC-it/EVALITA.

[20]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[21]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[22]  Hazem M. Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[23]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[24]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25]  Cristina Bosco,et al.  PartTUT: The Turin University Parallel Treebank , 2015, Italian Natural Language Processing within the PARLI Project.

[26]  Tommaso Caselli,et al.  BERTje: A Dutch BERT Model , 2019, ArXiv.

[27]  Giovanni Semeraro,et al.  AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets , 2019, CLiC-it.

[28]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[29]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[30]  Wanxiang Che,et al.  Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[31]  Maximilian Wendt,et al.  HDT-UD: A very large Universal Dependencies Treebank for German , 2019, Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019).

[32]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[33]  Christian Biemann,et al.  GermEval 2014 Named Entity Recognition Shared Task , 2014 .