BERTimbau: Pretrained BERT Models for Brazilian Portuguese

Recent advances in language representation using neural networks have made it viable to transfer the learned internal states of large pretrained language models (LMs) to downstream natural language processing (NLP) tasks. This transfer learning approach improves the overall performance on many tasks and is highly beneficial when labeled data is scarce, making pretrained LMs valuable resources specially for languages with few annotated training examples. In this work, we train BERT (Bidirectional Encoder Representations from Transformers) models for Brazilian Portuguese, which we nickname BERTimbau. We evaluate our models on three downstream NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity recognition. Our models improve the state-of-the-art in all of these tasks, outperforming Multilingual BERT and confirming the effectiveness of large pretrained LMs for Portuguese. We release our models to the community hoping to provide strong baselines for future NLP research: https://github.com/neuralmind-ai/portuguese-bert.

[1]  Tommaso Caselli,et al.  BERTje: A Dutch BERT Model , 2019, ArXiv.

[2]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[3]  Hazem M. Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[4]  Rui Rodrigues,et al.  IPR: The Semantic Textual Similarity and Recognizing Textual Entailment Systems , 2019, ASSIN@STIL.

[5]  Anderson da Silva Soares,et al.  Multilingual Transformer Ensembles for Portuguese Natural Language Tasks , 2019, ASSIN@STIL.

[6]  Anderson da Silva Soares,et al.  Portuguese Named Entity Recognition Using LSTM-CRF , 2018, PROPOR.

[7]  Nuno Seco,et al.  HAREM: An Advanced NER Evaluation Contest for Portuguese , 2006, LREC.

[8]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[9]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Joaquim Santos,et al.  Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition , 2019, 2019 8th Brazilian Conference on Intelligent Systems (BRACIS).

[13]  Renata Vieira,et al.  Multidomain Contextual Embeddings for Named Entity Recognition , 2019, IberLEF@SEPLN.

[14]  Giovanni Semeraro,et al.  AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets , 2019, CLiC-it.

[15]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[16]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[17]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[18]  Hugo Gonçalo Oliveira,et al.  The ASSIN 2 Shared Task: A Quick Overview , 2020, PROPOR.

[19]  Aline Villavicencio,et al.  The brWaC Corpus: A New Open Resource for Brazilian Portuguese , 2018, LREC.

[20]  Nádia Félix Felipe da Silva,et al.  Contextual Representations and Semi-Supervised Named Entity Recognition for Portuguese Language , 2019, IberLEF@SEPLN.

[21]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[26]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[27]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[28]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[29]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[30]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[31]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[32]  Noah A. Smith,et al.  To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[33]  Anderson da Silva Soares,et al.  Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks , 2020, PROPOR.