NormFormer: Improved Transformer Pretraining with Extra Normalization

During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The extra operations incur negligible compute cost (+0.4% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked language models ranging from 125 Million to 2.7 Billion parameters. For example, adding NormFormer on top of our strongest 1.3B parameter baseline can reach equal perplexity 24% faster, or converge 0.27 perplexity better in the same compute budget. This model reaches GPT3-Large (1.3B) zero shot performance 60% faster. For masked language modeling, NormFormer improves fine-tuned GLUE performance by 1.9% on average. Code to train NormFormer models is available in fairseq.

[1]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[2]  Brian McWilliams,et al.  The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.

[3]  Jiawei Han,et al.  Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[4]  Omer Levy,et al.  Improving Transformer Models by Reordering their Sublayers , 2020, ACL.

[5]  Zhe Gan,et al.  EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets , 2020, ACL.

[6]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[7]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[8]  Ivan Vulic,et al.  Unsupervised Cross-Lingual Representation Learning , 2019, ACL.

[9]  Tom Goldstein,et al.  GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training , 2021, NeurIPS.

[10]  Garrison W. Cottrell,et al.  ReZero is All You Need: Fast Convergence at Large Depth , 2020, UAI.

[11]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[15]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[16]  Noah A. Smith,et al.  Shortformer: Better Language Modeling using Shorter Inputs , 2021, ACL.

[17]  Maksims Volkovs,et al.  Improving Transformer Optimization Through Better Initialization , 2020, ICML.

[18]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[19]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[20]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[21]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[22]  Quoc V. Le,et al.  Primer: Searching for Efficient Transformers for Language Modeling , 2021, NeurIPS.

[23]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[24]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[25]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[26]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[28]  Noam Shazeer,et al.  GLU Variants Improve Transformer , 2020, ArXiv.

[29]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[32]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[33]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .