Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training

Recent works have demonstrated great success in training high-capacity autoregressive language models (GPT, GPT-2, GPT-3) on a huge amount of unlabeled text corpus for text generation. Despite showing great results, autoregressive models are facing a growing training instability issue. Our study on GPT-2 models (117M and 1.5B parameters) show that larger model sizes, sequence lengths, batch sizes, and learning rates would lead to lower training stability and increasing divergence risks. To avoid divergence and achieve better generalization performance, one has to train with smaller batch sizes and learning rates, which leads to worse training efficiency and longer training time. To overcome this stability-efficiency dilemma, we present a study of a curriculum learning-based approach, which helps improves the pre-training convergence speed of autoregressive models. More importantly, we find that curriculum learning, as a regularization method, exerts a gradient variance reduction effect and enables to train autoregressive models with much larger batch sizes and learning rates without training instability, further improving the training speed. Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training divergence. To achieve the same validation perplexity targets during pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 61% and 49%, respectively. To achieve the same or better zero-shot WikiText-103/LAMBADA evaluation results at the end of pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 54% and 70%, respectively.

[1]  Swarat Chaudhuri,et al.  Control Regularization for Reduced Variance Reinforcement Learning , 2019, ICML.

[2]  Eric P. Xing,et al.  Self-Training for Jointly Learning to Ask and Answer Questions , 2018, NAACL.

[3]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[6]  Olatunji Ruwase,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Nicu Sebe,et al.  Curriculum Learning: A Survey , 2021 .

[8]  Huda Khayrallah,et al.  An Empirical Exploration of Curriculum Learning for Neural Machine Translation , 2018, ArXiv.

[9]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[10]  Eric P. Xing,et al.  Easy Questions First? A Case Study on Curriculum Learning for Question Answering , 2016, ACL.

[11]  Barnabás Póczos,et al.  Competence-based Curriculum Learning for Neural Machine Translation , 2019, NAACL.

[12]  Xi Chen,et al.  Variance Reduction for Stochastic Gradient Optimization , 2013, NIPS.

[13]  Ammar Ahmad Awan,et al.  1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed , 2021, 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[14]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[15]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[16]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[17]  Zheng Cao,et al.  Reducing BERT Computation by Padding Removal and Curriculum Learning , 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[18]  Xiangru Lian,et al.  1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed , 2021, ICML.

[19]  Noah A. Smith,et al.  Shortformer: Better Language Modeling using Shorter Inputs , 2021, ACL.

[20]  Yongdong Zhang,et al.  Curriculum Learning for Natural Language Understanding , 2020, ACL.

[21]  Hongzi Mao,et al.  Variance Reduction for Reinforcement Learning in Input-Driven Environments , 2018, ICLR.

[22]  Xin Wang,et al.  A Comprehensive Survey on Curriculum Learning , 2020, ArXiv.

[23]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[24]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Ondrej Bojar,et al.  Results of the WMT17 Neural MT Training Task , 2017, WMT.

[29]  Siu Cheung Hui,et al.  Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives , 2019, ACL.

[30]  Daniel Campos,et al.  Curriculum learning for language modeling , 2021, ArXiv.

[31]  Kevin Duh,et al.  Curriculum Learning for Domain Adaptation in Neural Machine Translation , 2019, NAACL.

[32]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[33]  Terence D. Sanger,et al.  Neural network learning control of robot manipulators using gradually increasing task difficulty , 1994, IEEE Trans. Robotics Autom..

[34]  Ondrej Bojar,et al.  Curriculum Learning and Minibatch Bucketing in Neural Machine Translation , 2017, RANLP.

[35]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[36]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[37]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[38]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[39]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.