Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.

[1]  Mehdi Rezagholizadeh,et al.  Fully Quantized Transformer for Machine Translation , 2020, EMNLP.

[2]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[3]  Samy Bengio,et al.  Discrete Autoencoders for Sequence Models , 2018, ArXiv.

[4]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[5]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[6]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[7]  Zhijie Zhang,et al.  Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch , 2021, ICLR.

[8]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[9]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Hany Hassan Awadalla,et al.  FastFormers: Highly Efficient Transformer Models for Natural Language Understanding , 2020, SUSTAINLP.

[12]  Krzysztof Maziarz,et al.  Gumbel-Matrix Routing for Flexible Multi-task Learning , 2019, ArXiv.

[13]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[14]  Forrest N. Iandola,et al.  SqueezeBERT: What can computer vision teach NLP about efficient neural networks? , 2020, SUSTAINLP.

[15]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[16]  Song Han,et al.  HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[17]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[18]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[19]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[20]  Grigorios Tsoumakas,et al.  A Divide-and-Conquer Approach to the Summarization of Long Documents , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  H. Ney,et al.  Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture , 2020, ACL.

[22]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[23]  Swagath Venkataramani,et al.  Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.

[24]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[25]  Yu Zhang,et al.  Training RNNs as Fast as CNNs , 2017, EMNLP 2018.

[26]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[27]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[28]  Manish Gupta,et al.  Compression of Deep Learning Models for Text: A Survey , 2022, ACM Trans. Knowl. Discov. Data.

[29]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[30]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[31]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[32]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[33]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[34]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[35]  Wenhu Chen,et al.  Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting , 2019, NeurIPS.

[36]  Ji Li,et al.  Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning , 2020, FINDINGS.

[37]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[38]  Di He,et al.  Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation , 2018, NeurIPS.

[39]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[40]  Lukasz Kaiser,et al.  Neural GPUs Learn Algorithms , 2015, ICLR.

[41]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[42]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[43]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[44]  Colin Raffel,et al.  Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.

[45]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.