bert2BERT: Towards Reusable Pretrained Language Models

In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. In this paper, we propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model (e.g., BERTBASE) to a large model (e.g., BERTLARGE) through parameter initialization and significantly improve the pre-training efficiency of the large model. Specifically, we extend the previous function-preserving (Chen et al., 2016) on Transformer-based language model, and further improve it by proposing advanced knowledge for large model’s initialization. In addition, a two-stage pre-training method is proposed to further accelerate the training process. We did extensive experiments on representative PLMs (e.g., BERT and GPT) and demonstrate that (1) our method can save a significant amount of training cost compared with baselines including learning from scratch, StackBERT (Gong et al., 2019) and MSLT (Yang et al., 2020); (2) our method is generic and applicable to different types of pre-trained models. In particular, bert2BERT saves about 45% and 47% computational cost of pre-training BERTBASE and GPTBASE by reusing the models of almost their half sizes. The source code will be publicly available upon publication.

[1]  Minjia Zhang,et al.  Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping , 2020, NeurIPS.

[2]  Di He,et al.  Efficient Training of BERT by Progressively Stacking , 2019, ICML.

[3]  Oren Etzioni,et al.  Green AI , 2019, Commun. ACM.

[4]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[5]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[6]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[7]  Chen Chen,et al.  On the Transformer Growth for Progressive BERT Training , 2020, NAACL.

[8]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[9]  Tianqi Chen,et al.  Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[10]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[11]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[12]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[14]  Daniel Allen,et al.  The transformer. , 2000, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[15]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[16]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[17]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[18]  Qiang Liu,et al.  Energy-Aware Neural Architecture Optimization with Fast Splitting Steepest Descent , 2019, ArXiv.

[19]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[22]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[23]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[24]  Chen Xing,et al.  Taking Notes on the Fly Helps Language Pre-Training , 2021, ICLR.

[25]  Priyadarshini Panda,et al.  Energy-efficient and Robust Cumulative Training with Net2Net Transformation , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[26]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[27]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[29]  Jingqiao Zhang,et al.  Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup , 2020, ArXiv.

[30]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[31]  Qiang Liu,et al.  Splitting Steepest Descent for Growing Neural Architectures , 2019, NeurIPS.

[32]  Kaisheng M. Wang,et al.  PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , 2021, ArXiv.

[33]  Bo Liu,et al.  Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks , 2021, NeurIPS.

[34]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[35]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Qiang Liu,et al.  Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting , 2020, ArXiv.

[38]  Zhiyuan Liu,et al.  Knowledge Inheritance for Pre-trained Language Models , 2021, ArXiv.

[39]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.