Blockwise Parallel Transformer for Long Context Large Models

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences up to 32 times longer than vanilla Transformers and 2 to 4 times longer than previous memory-efficient methods. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance.

[1]  P. Abbeel,et al.  Emergent Agentic Transformer from Chain of Hindsight Experience , 2023, ICML.

[2]  Xinyun Chen,et al.  Teaching Large Language Models to Self-Debug , 2023, ArXiv.

[3]  Caglar Gulcehre,et al.  Resurrecting Recurrent Neural Networks for Long Sequences , 2023, ICML.

[4]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[5]  Daniel Y. Fu,et al.  Hyena Hierarchy: Towards Larger Convolutional Language Models , 2023, ICML.

[6]  P. Abbeel,et al.  Chain of Hindsight Aligns Language Models with Feedback , 2023, ArXiv.

[7]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[8]  Alexander M. Rush,et al.  Pretraining Without Attention , 2022, EMNLP.

[9]  Carlos Riquelme Ruiz,et al.  Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints , 2022, ICLR.

[10]  Junhyuk Oh,et al.  In-context Reinforcement Learning with Algorithm Distillation , 2022, ICLR.

[11]  Luke Zettlemoyer,et al.  Mega: Moving Average Equipped Gated Attention , 2022, ICLR.

[12]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[13]  Hyung Won Chung,et al.  Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? , 2022, EMNLP.

[14]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[15]  Lawrence C. McAfee,et al.  Reducing Activation Recomputation in Large Transformer Models , 2022, MLSys.

[16]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[17]  T. Zhao,et al.  MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation , 2022, NAACL.

[18]  Quoc V. Le,et al.  Transformer Quality in Linear Time , 2022, ICML.

[19]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[20]  P. Abbeel,et al.  Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning , 2022, ArXiv.

[21]  Joseph Gonzalez,et al.  Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , 2022, OSDI.

[22]  Albert Gu,et al.  Efficiently Modeling Long Sequences with Structured State Spaces , 2021, ICLR.

[23]  Jie Zhou,et al.  MoEfication: Transformer Feed-forward Layers are Mixtures of Experts , 2021, FINDINGS.

[24]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[25]  Markus N. Rabe,et al.  Self-attention Does Not Need $O(n^2)$ Memory , 2021, 2112.05682.

[26]  Pieter Abbeel,et al.  URLB: Unsupervised Reinforcement Learning Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[27]  Yang You,et al.  Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training , 2021, ArXiv.

[28]  R. Pappu,et al.  AlphaFold and implications for intrinsically disordered proteins. , 2021, Journal of molecular biology.

[29]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[30]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[31]  Nitish Srivastava,et al.  An Attention Free Transformer , 2021, ArXiv.

[32]  Noam Shazeer,et al.  GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[33]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[34]  Hyung Won Chung,et al.  Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.

[35]  Irwan Bello LambdaNetworks: Modeling Long-Range Interactions Without Attention , 2021, ICLR.

[36]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.

[37]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[38]  Lucy J. Colwell,et al.  Rethinking Attention with Performers , 2020, ICLR.

[39]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[40]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[41]  C. Ré,et al.  HiPPO: Recurrent Memory with Optimal Polynomial Projections , 2020, NeurIPS.

[42]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[43]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[44]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[45]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[46]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[47]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[48]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[49]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[50]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[51]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[52]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[53]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[54]  Natalia Gimelshein,et al.  Online normalizer calculation for softmax , 2018, ArXiv.

[55]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[56]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.