GACT: Activation Compressed Training for General Architectures

Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT’s approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1 × , enabling training with a 4.2 × to 24.7 × larger batch size, with negligible accuracy loss.

[1]  Markus N. Rabe,et al.  Self-attention Does Not Need $O(n^2)$ Memory , 2021, 2112.05682.

[2]  Bohan Zhuang,et al.  Mesa: A Memory-saving Training Framework for Transformers , 2021, ArXiv.

[3]  Ion Stoica,et al.  ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training , 2021, ICML.

[4]  Olatunji Ruwase,et al.  ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[5]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[6]  Guanpeng Li,et al.  A novel memory-efficient deep learning training framework via error-bounded lossy compression , 2020, PPoPP.

[7]  Joseph E. Gonzalez,et al.  A Statistical Framework for Low-bitwidth Training of Deep Neural Networks , 2020, NeurIPS.

[8]  Jiawei Jiang,et al.  Don't Waste Your Bits! Squeeze Activations and Gradients for Deep Neural Networks via TinyScript , 2020, ICML.

[9]  Yaliang Li,et al.  Simple and Deep Graph Convolutional Networks , 2020, ICML.

[10]  Tianqi Chen,et al.  Dynamic Tensor Rematerialization , 2020, International Conference on Learning Representations.

[11]  J. Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[12]  Tor M. Aamodt,et al.  JPEG-ACT: Accelerating Deep Learning via Transform-based Lossy Compression , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[13]  Gu Jin,et al.  SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping , 2020, ASPLOS.

[14]  Hai Jin,et al.  Capuchin: Tensor-based GPU Memory Management for Deep Learning , 2020, ASPLOS.

[15]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[16]  P. Abbeel,et al.  Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.

[17]  Rajgopal Kannan,et al.  GraphSAINT: Graph Sampling Based Inductive Learning Method , 2019, ICLR.

[18]  Benjamin Moseley,et al.  Backprop with Approximate Activations for Memory-efficient Network Training , 2019, NeurIPS.

[19]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[20]  Elad Hoffer,et al.  Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[21]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[22]  Shuang Wu,et al.  Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[23]  Zenglin Xu,et al.  Superneurons: dynamic GPU memory management for training deep neural networks , 2018, PPoPP.

[24]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[25]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[29]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[30]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[33]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[36]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Fan Yang,et al.  EXACT: Scalable Graph Neural Networks Training via Extreme Activation Compression , 2022, ICLR.

[38]  Tor M. Aamodt,et al.  AC-GC: Lossy Activation Compression with Guaranteed Convergence , 2021, NeurIPS.

[39]  Chang Zhou,et al.  CogDL: A Toolkit for Deep Learning on Graphs , 2021 .

[40]  Olivier Beaumont,et al.  Efficient Combination of Rematerialization and Offloading for Training DNNs , 2021, NeurIPS.

[41]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Swagath Venkataramani,et al.  Ultra-Low Precision 4-bit Training of Deep Neural Networks , 2020, NeurIPS.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.