On the Importance of Pre-training Data Volume for Compact Language Models

Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.

[1]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[2]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[3]  Avirup Sil,et al.  Structured Pruning of a BERT-based Question Answering Model , 2019 .

[4]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[5]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[6]  Martin d'Hoffschmidt,et al.  FQuAD: French Question Answering Dataset , 2020, FINDINGS.

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Alexander M. Rush,et al.  Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[9]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[10]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[11]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , 2019, ArXiv.

[12]  Yin Yang,et al.  Compressing Large-Scale Transformer-Based Models: A Case Study on BERT , 2020, ArXiv.

[13]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[14]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[15]  Edouard Grave,et al.  Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[16]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[17]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[18]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[19]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[20]  Kurt Keutzer,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2020, AAAI.

[21]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[22]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[23]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[24]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[25]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[26]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[27]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[30]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..