Data Movement Is All You Need: A Case Study on Optimizing Transformers

Transformers have become widely used for language modeling and sequence learning tasks, and are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute performance, training has now become memory-bound. Further, existing frameworks use suboptimal data layouts. Using these insights, we present a recipe for globally optimizing data movement in transformers. We reduce data movement by up to 22.91% and overall achieve a 1.30x performance improvement over state-of-the-art frameworks when training BERT. Our approach is applicable more broadly to optimizing deep neural networks, and offers insight into how to tackle emerging performance bottlenecks.

[1]  Torsten Hoefler,et al.  Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 , 2017 .

[2]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Mary W. Hall,et al.  SWIRL: High-performance many-core CPU code generation for deep neural networks , 2019, Int. J. High Perform. Comput. Appl..

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Takuya Akiba,et al.  Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.

[7]  Torsten Hoefler,et al.  Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.

[8]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[9]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[10]  Patrice Y. Simard,et al.  High Performance Convolutional Neural Networks for Document Processing , 2006 .

[11]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[12]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[13]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[14]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Dong Yu,et al.  Pipelined BackPropagation for Context-Dependent Deep Neural Networks , 2012 .

[17]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[18]  Hiroaki Mikami,et al.  Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash , 2018 .

[19]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[20]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[21]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[22]  Kurt Keutzer,et al.  Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.

[23]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[25]  Lane Schwartz,et al.  DLVM: A modern compiler infrastructure for deep learning systems , 2017, ICLR.

[26]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[27]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Lidong Zhou,et al.  Astra: Exploiting Predictability to Optimize Deep Learning , 2019, ASPLOS.

[29]  Jacek Tabor,et al.  Molecule Attention Transformer , 2020, ArXiv.

[30]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[31]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[32]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[33]  Tao Wang,et al.  Image Classification at Supercomputer Scale , 2018, ArXiv.

[34]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[35]  Benoît Meister,et al.  Polyhedral Optimization of TensorFlow Computation Graphs , 2017, ESPT/VPA@SC.

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Nam Sung Kim,et al.  Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training , 2018, NeurIPS.

[38]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[39]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[40]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[41]  Sergio Gomez Colmenarejo,et al.  TF-Replicator: Distributed Machine Learning for Researchers , 2019, ArXiv.

[42]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[43]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[44]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[45]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[46]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[47]  Jonathan S. Rosenfeld,et al.  A Constructive Prediction of the Generalization Error Across Scales , 2020, ICLR.

[48]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[49]  Michael Carbin,et al.  TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning , 2020, ArXiv.

[50]  Marc Snir,et al.  Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[51]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[52]  Hariharan Sandanagobalane,et al.  Diesel: DSL for linear algebra and neural net computations on GPUs , 2018, MAPL@PLDI.

[53]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[54]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[55]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[56]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[57]  Yi Yang,et al.  Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[58]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[59]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[60]  Matei Zaharia,et al.  Optimizing DNN Computation with Relaxed Graph Substitutions , 2019, MLSys.

[61]  Siu Cheung Hui,et al.  Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives , 2019, ACL.

[62]  Marc Snir,et al.  Channel and filter parallelism for large-scale CNN training , 2019, SC.

[63]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[64]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[65]  Matthew Johnson,et al.  Compiling machine learning programs via high-level tracing , 2018 .

[66]  Noam Shazeer,et al.  Fast Transformer Decoding: One Write-Head is All You Need , 2019, ArXiv.

[67]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[68]  Lei Liu,et al.  Acorns: A Framework for Accelerating Deep Neural Networks with Input Sparsity , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[69]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[70]  Raquel Urtasun,et al.  The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.

[71]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[72]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[73]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[74]  Uday Bondhugula,et al.  MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[75]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[76]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[77]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[78]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[79]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[80]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[81]  Daniele Paolo Scarpazza,et al.  Dissecting the Graphcore IPU Architecture via Microbenchmarking , 2019, ArXiv.

[82]  Kjell Schubert,et al.  Transformer-Transducer: End-to-End Speech Recognition with Self-Attention , 2019, ArXiv.

[83]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .

[84]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[85]  Hai Liu,et al.  Latte: a language, compiler, and runtime for elegant and efficient deep neural networks , 2016, PLDI.

[86]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[87]  Guillaume Lample,et al.  Deep Learning for Symbolic Mathematics , 2019, ICLR.

[88]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[89]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[90]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[91]  Alexander Aiken,et al.  TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[92]  Pierre Vandergheynst,et al.  Geometric Deep Learning: Going beyond Euclidean data , 2016, IEEE Signal Process. Mag..

[93]  Hyojin Kim,et al.  LBANN: livermore big artificial neural network HPC toolkit , 2015, MLHPC@SC.

[94]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[95]  D. Scott Cyphers,et al.  Intel® nGraphTM , 2018 .

[96]  Olatunji Ruwase,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.

[97]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[98]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[99]  Torsten Hoefler,et al.  Accelerating Deep Learning Frameworks with Micro-Batches , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[100]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[101]  Guillermo Sapiro,et al.  Deep learning? , 1999 .

[102]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[103]  John Shalf,et al.  Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[104]  Kurt Keutzer,et al.  Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.

[105]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[106]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[107]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[108]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[109]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[110]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[111]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[112]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[113]  Chengqi Zhang,et al.  Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling , 2018, ICLR.

[114]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[115]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[116]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[117]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[118]  Torsten Hoefler,et al.  A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations , 2019, SC.

[119]  Bertrand A. Maher,et al.  Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.

[120]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.