论文信息 - bert2BERT: Towards Reusable Pretrained Language Models - 字舞流文

bert2BERT: Towards Reusable Pretrained Language Models

In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. In this paper, we propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model (e.g., BERTBASE) to a large model (e.g., BERTLARGE) through parameter initialization and significantly improve the pre-training efficiency of the large model. Specifically, we extend the previous function-preserving (Chen et al., 2016) on Transformer-based language model, and further improve it by proposing advanced knowledge for large model’s initialization. In addition, a two-stage pre-training method is proposed to further accelerate the training process. We did extensive experiments on representative PLMs (e.g., BERT and GPT) and demonstrate that (1) our method can save a significant amount of training cost compared with baselines including learning from scratch, StackBERT (Gong et al., 2019) and MSLT (Yang et al., 2020); (2) our method is generic and applicable to different types of pre-trained models. In particular, bert2BERT saves about 45% and 47% computational cost of pre-training BERTBASE and GPTBASE by reusing the models of almost their half sizes. The source code will be publicly available upon publication.

Lifeng Shang | Qun Liu | Yichun Yin | Xin Jiang | Yujia Qin | Fengyu Wang | Cheng Chen | Zhi Wang | Xiao Chen | Zhiyuan Liu | Qun Liu | Lifeng Shang | Yujia Qin | Yichun Yin | Xin Jiang | Cheng Chen | Xiao Chen | Fengyu Wang | Zhi Wang | Zhiyuan Liu

[1] Minjia Zhang,et al. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping , 2020, NeurIPS.

[2] Di He,et al. Efficient Training of BERT by Progressively Stacking , 2019, ICML.

[3] Oren Etzioni,et al. Green AI , 2019, Commun. ACM.

[4] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[5] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[6] Benoît Sagot,et al. What Does BERT Learn about the Structure of Language? , 2019, ACL.

[7] Chen Chen,et al. On the Transformer Growth for Progressive BERT Training , 2020, NAACL.

[8] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[9] Tianqi Chen,et al. Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[10] Liwei Wang,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.

[11] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[12] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[14] Daniel Allen,et al. The transformer. , 2000, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[15] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[16] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[17] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[18] Qiang Liu,et al. Energy-Aware Neural Architecture Optimization with Fast Splitting Steepest Descent , 2019, ArXiv.

[19] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[22] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[23] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .

[24] Chen Xing,et al. Taking Notes on the Fly Helps Language Pre-Training , 2021, ICLR.

[25] Priyadarshini Panda,et al. Energy-efficient and Robust Cumulative Training with Net2Net Transformation , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[26] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[27] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[29] Jingqiao Zhang,et al. Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup , 2020, ArXiv.

[30] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[31] Qiang Liu,et al. Splitting Steepest Descent for Growing Neural Architectures , 2019, NeurIPS.

[32] Kaisheng M. Wang,et al. PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , 2021, ArXiv.

[33] Bo Liu,et al. Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks , 2021, NeurIPS.

[34] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[35] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[36] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[37] Qiang Liu,et al. Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting , 2020, ArXiv.

[38] Zhiyuan Liu,et al. Knowledge Inheritance for Pre-trained Language Models , 2021, ArXiv.

[39] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.