On the Transformer Growth for Progressive BERT Training

As the excessive pre-training cost arouses the need to improve efficiency, considerable efforts have been made to train BERT progressively–start from an inferior but low-cost model and gradually increase the computational complexity. Our objective is to help advance the understanding of such Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture selection, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give practical guidance for operator selection. In light of our analyses, the proposed method CompoundGrow speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively while achieving comparable performances.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Tomas Mikolov,et al.  Variable Computation in Recurrent Neural Networks , 2016, ICLR.

[3]  Frederick Tung,et al.  Multi-level Residual Networks from Dynamical Systems View , 2017, ICLR.

[4]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[5]  Di He,et al.  Efficient Training of BERT by Progressively Stacking , 2019, ICML.

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Jiawei Han,et al.  Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[8]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[9]  Changhu Wang,et al.  Modularized Morphing of Neural Networks , 2017, ArXiv.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[12]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[13]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[14]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[15]  Kaiming He,et al.  A Multigrid Method for Efficiently Training Video Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tianqi Chen,et al.  Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[17]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[18]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[19]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[20]  Jingbo Zhu,et al.  Shallow-to-Deep Training for Neural Machine Translation , 2020, EMNLP.

[21]  Guokun Lai,et al.  Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , 2020, NeurIPS.

[22]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Jingbo Shang,et al.  Towards Adaptive Residual Network Training: A Neural-ODE Perspective , 2020, ICML.

[25]  Xiang Ren,et al.  Empower Sequence Labeling with Task-Aware Neural Language Model , 2017, AAAI.

[26]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[27]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[28]  Changhu Wang,et al.  Network Morphism , 2016, ICML.

[29]  Feng Yan,et al.  AutoGrow: Automatic Layer Growing in Deep Convolutional Networks , 2019, KDD.