论文信息 - Pre-training Polish Transformer-based Language Models at Scale - 字舞流文

Pre-training Polish Transformer-based Language Models at Scale

Transformer-based language models are now widely used in Natural Language Processing (NLP). This statement is especially true for English language, in which many pre-trained models utilizing transformer-based architecture have been published in recent years. This has driven forward the state of the art for a variety of standard NLP tasks such as classification, regression, and sequence labeling, as well as text-to-text tasks, such as machine translation, question answering, or summarization. The situation have been different for low-resource languages, such as Polish, however. Although some transformer-based language models for Polish are available, none of them have come close to the scale, in terms of corpus size and the number of parameters, of the largest English-language models. In this study, we present two language models for Polish based on the popular BERT architecture. The larger model was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text. We describe our methodology for collecting the data, preparing the corpus, and pre-training the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements over previous approaches in eleven of them.

Michal Perelkiewicz | Slawomir Dadas | Rafal Po'swiata | Michal Perelkiewicz | Slawomir Dadas | Rafal Poswiata

[1] Tapio Salakoski,et al. Multilingual is not enough: BERT for Finnish , 2019, ArXiv.

[2] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[3] Pieter Delobelle,et al. RobBERT: a Dutch RoBERTa-based Language Model , 2020, EMNLP.

[4] Marco Marelli,et al. A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[5] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[6] Maciej Piasecki,et al. Open dataset for development of Polish Question Answering systems , 2013 .

[7] Laurent Romary,et al. CamemBERT: a Tasty French Language Model , 2019, ACL.

[8] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[9] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[10] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[11] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[12] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[13] Tommaso Caselli,et al. BERTje: A Dutch BERT Model , 2019, ArXiv.

[14] Roland Vollgraf,et al. Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[15] Liang Xu,et al. CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model , 2020, ArXiv.

[16] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[17] Carina Silberer,et al. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2013 .

[18] Michal Perelkiewicz,et al. Evaluation of Sentence Representations in Polish , 2020, LREC.

[19] Mateusz Kopec,et al. The Polish Summaries Corpus , 2014, LREC.

[20] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[21] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[22] Hao Tian,et al. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[23] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[24] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25] Wanxiang Che,et al. Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[26] Benjamin Lecouteux,et al. FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[27] Hazem M. Hajj,et al. AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[28] Alina Wróblewska,et al. Polish evaluation dataset for compositional distributional semantics models , 2017, ACL.

[29] Slawomir Dadas. Combining neural and knowledge-based approaches to Named Entity Recognition in Polish , 2019, ICAISC.

[30] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[32] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[33] Rodrigo Nogueira,et al. Portuguese Named Entity Recognition using BERT-CRF , 2019, ArXiv.

[34] Mikhail Arkhipov,et al. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language , 2019, ArXiv.

[35] Alexey Sorokin,et al. Tuning Multilingual Transformers for Language-Specific Named Entity Recognition , 2019, BSNLP@ACL.

[36] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37] Benoît Sagot,et al. What Does BERT Learn about the Structure of Language? , 2019, ACL.

[38] Eduard Hovy,et al. Nested Named Entity Recognition via Second-best Sequence Learning and Decoding , 2019, Transactions of the Association for Computational Linguistics.

[39] Anna Rumshisky,et al. Revealing the Dark Secrets of BERT , 2019, EMNLP.

[40] Jan Kocoń,et al. Multi-Level Sentiment Analysis of PolEmo 2.0: Extended Corpus of Multi-Domain Consumer Reviews , 2019, CoNLL.

[41] Dat Quoc Nguyen,et al. PhoBERT: Pre-trained language models for Vietnamese , 2020, EMNLP.

[42] Piotr Rybak,et al. KLEJ: Comprehensive Benchmark for Polish Language Understanding , 2020, ACL.