Sparse Bitmap Compression for Memory-Efficient Training on the Edge

Training on the Edge enables neural networks to learn continuously from new data after deployment on memory-constrained edge devices. Previous work is mostly concerned with reducing the number of model parameters which is only beneficial for inference. However, memory footprint from activations is the main bottleneck for training on the edge. Existing incremental training methods fine-tune the last few layers sacrificing accuracy gains from re-training the whole model. In this work, we investigate the memory footprint of training deep learning models, and use our observations to propose BitTrain. In BitTrain, we exploit activation sparsity and propose a novel bitmap compression technique that reduces the memory footprint during training. We save the activations in our proposed bitmap compression format during the forward pass of the training, and restore them during the backward pass for the optimizer computations. The proposed method can be integrated seamlessly in the computation graph of modern deep learning frameworks. Our implementation is safe by construction, and has no negative impact on the accuracy of model training. Experimental results show up to 34% reduction in the memory footprint at a sparsity level of 50%. Further pruning during training results in more than 70% sparsity, which can lead to up to 56% reduction in memory footprint. BitTrain advances the efforts towards bringing more machine learning capabilities to edge devices. Our source code is available at https://github.com/scale-lab/BitTrain.

[1]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[2]  Ahmet Ali Süzen,et al.  Benchmark Analysis of Jetson TX2, Jetson Nano and Raspberry PI using Deep-CNN , 2020, 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA).

[3]  Yunhui Guo,et al.  A Survey on Methods and Theories of Quantized Neural Networks , 2018, ArXiv.

[4]  Liu Liu,et al.  Dynamic Sparse Graph for Efficient Deep Learning , 2018, ICLR.

[5]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[6]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[7]  Radu Marculescu,et al.  Model Personalization for Human Activity Recognition , 2020, 2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops).

[8]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[9]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[10]  F. Lin,et al.  Personalized Control of Indoor Air Temperature Based on Deep Learning , 2019, 2019 Chinese Control And Decision Conference (CCDC).

[11]  Orestis Zachariadis,et al.  Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores , 2020, Comput. Electr. Eng..

[12]  Pradeep Dubey,et al.  Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[13]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[14]  Fenglong Ma,et al.  Deep Patient Similarity Learning for Personalized Healthcare , 2018, IEEE Transactions on NanoBioscience.

[15]  Larry S. Davis,et al.  NISP: Pruning Networks Using Neuron Importance Score Propagation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Song Han,et al.  TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning , 2020, NeurIPS.

[17]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[18]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[19]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  S. Nirjon,et al.  Learning in the Wild: When, How, and What to Learn for On-Device Dataset Adaptation , 2020, AIChallengeIoT@SenSys.

[21]  Robert Ivlev,et al.  Rocky 7: a next generation Mars rover prototype , 1996, Adv. Robotics.

[22]  Andreas Moshovos,et al.  TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Xukan Ran,et al.  Deep Learning With Edge Computing: A Review , 2019, Proceedings of the IEEE.

[24]  Sherief Reda,et al.  Understanding the impact of precision quantization on the accuracy and energy of neural networks , 2016, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[25]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[26]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[27]  Ramesh Raskar,et al.  Split learning for health: Distributed deep learning without sharing raw patient data , 2018, ArXiv.

[28]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[29]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[31]  Raul A. Romero,et al.  Athena Mars rover science investigation , 2003 .

[32]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[33]  Alex Graves,et al.  Memory-Efficient Backpropagation Through Time , 2016, NIPS.

[34]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[35]  Mianxiong Dong,et al.  Learning IoT in Edge: Deep Learning for the Internet of Things with Edge Computing , 2018, IEEE Network.

[36]  Tomoya Murata,et al.  Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error , 2018, IJCAI.

[37]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[38]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[39]  Janowsky,et al.  Pruning versus clipping in neural networks. , 1989, Physical review. A, General physics.

[40]  Christopher Ré,et al.  Low-Memory Neural Network Training: A Technical Report , 2019, ArXiv.

[41]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.