Pre-training Polish Transformer-based Language Models at Scale

Transformer-based language models are now widely used in Natural Language Processing (NLP). This statement is especially true for English language, in which many pre-trained models utilizing transformer-based architecture have been published in recent years. This has driven forward the state of the art for a variety of standard NLP tasks such as classification, regression, and sequence labeling, as well as text-to-text tasks, such as machine translation, question answering, or summarization. The situation have been different for low-resource languages, such as Polish, however. Although some transformer-based language models for Polish are available, none of them have come close to the scale, in terms of corpus size and the number of parameters, of the largest English-language models. In this study, we present two language models for Polish based on the popular BERT architecture. The larger model was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text. We describe our methodology for collecting the data, preparing the corpus, and pre-training the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements over previous approaches in eleven of them.

[1]  Tapio Salakoski,et al.  Multilingual is not enough: BERT for Finnish , 2019, ArXiv.

[2]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[3]  Pieter Delobelle,et al.  RobBERT: a Dutch RoBERTa-based Language Model , 2020, EMNLP.

[4]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[5]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[6]  Maciej Piasecki,et al.  Open dataset for development of Polish Question Answering systems , 2013 .

[7]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[8]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[9]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[10]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[11]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[12]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[13]  Tommaso Caselli,et al.  BERTje: A Dutch BERT Model , 2019, ArXiv.

[14]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[15]  Liang Xu,et al.  CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model , 2020, ArXiv.

[16]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[17]  Carina Silberer,et al.  Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2013 .

[18]  Michal Perelkiewicz,et al.  Evaluation of Sentence Representations in Polish , 2020, LREC.

[19]  Mateusz Kopec,et al.  The Polish Summaries Corpus , 2014, LREC.

[20]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[21]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[22]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25]  Wanxiang Che,et al.  Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[26]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[27]  Hazem M. Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[28]  Alina Wróblewska,et al.  Polish evaluation dataset for compositional distributional semantics models , 2017, ACL.

[29]  Slawomir Dadas Combining neural and knowledge-based approaches to Named Entity Recognition in Polish , 2019, ICAISC.

[30]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[32]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[33]  Rodrigo Nogueira,et al.  Portuguese Named Entity Recognition using BERT-CRF , 2019, ArXiv.

[34]  Mikhail Arkhipov,et al.  Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language , 2019, ArXiv.

[35]  Alexey Sorokin,et al.  Tuning Multilingual Transformers for Language-Specific Named Entity Recognition , 2019, BSNLP@ACL.

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[38]  Eduard Hovy,et al.  Nested Named Entity Recognition via Second-best Sequence Learning and Decoding , 2019, Transactions of the Association for Computational Linguistics.

[39]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[40]  Jan Kocoń,et al.  Multi-Level Sentiment Analysis of PolEmo 2.0: Extended Corpus of Multi-Domain Consumer Reviews , 2019, CoNLL.

[41]  Dat Quoc Nguyen,et al.  PhoBERT: Pre-trained language models for Vietnamese , 2020, EMNLP.

[42]  Piotr Rybak,et al.  KLEJ: Comprehensive Benchmark for Polish Language Understanding , 2020, ACL.