SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training

The record-breaking performance of deep neural networks (DNNs) comes with heavy parameter budgets, which lead to external dynamic random-access memory (DRAM) for storage. The prohibitive energy of DRAM accesses makes it non-trivial for DNN deployment on resource-constrained devices, calling for minimizing the movements of weights and data in order to improve the energy efficiency. Driven by this critical bottleneck, we present SmartDeal, a hardware-friendly algorithm framework to trade higher-cost memory storage/access for lowercost computation, in order to aggressively boost the storage and energy efficiency, for both DNN inference and training. The core technique of SmartDeal is a novel DNN weight matrix decomposition framework with respective structural constraints on each matrix factor, carefully crafted to unleash the hardwareaware efficiency potential. Specifically, we decompose each weight tensor as the product of a small basis matrix and a large structurally sparse coefficient matrix whose non-zero elements are readily quantized to power-of-2. The resulting sparse and readilyquantized DNNs enjoy greatly reduced energy consumption in data movement as well as weight storage, while incurring minimal overhead to recover the original weights thanks to the required sparse bit-operations and cost favorable computations. Beyond inference, we take another leap to embrace energy-efficient training, by introducing several innovative techniques to address the unique roadblocks arising in training while preserving the SmartDeal structures. We also design a dedicated hardware accelerator to fully utilize the new weight structure to improve the real energy efficiency and latency performance. We conduct experiments on both vision and language tasks, with nine models, four datasets, and three settings (inference-only, adaptation, and fine-tuning). Our extensive results show that: 1) being applied to inference, SmartDeal achieves up to 2.44× improvement in energy efficiency as evaluated via real hardware implementations; 2) being applied to training, SmartDeal can lead to 10.56× and 4.48× reduction in the storage and the training energy cost, respectively, with usually negligible accuracy loss, compared to state-of-the-art training baselines. Our source codes are available at: https://github.com/VITA-Group/SmartDeal.

[1]  Ji Liu,et al.  GAN Slimming: All-in-One GAN Compression by A Unified Optimization Framework , 2020, ECCV.

[2]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[3]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[4]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Greg Mori,et al.  CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-Quantization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Yue Wang,et al.  Deep k-Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions , 2018, ICML.

[7]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Yinghuan Shi,et al.  Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights , 2019 .

[9]  Christopher De Sa,et al.  SWALP : Stochastic Weight Averaging in Low-Precision Training , 2019, ICML.

[10]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[12]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[15]  Song Han,et al.  Exploring the Regularity of Sparse Structure in Convolutional Neural Networks , 2017, ArXiv.

[16]  Yue Wang,et al.  FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training , 2020, NeurIPS.

[17]  Patrick Judd,et al.  Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks , 2019, ASPLOS.

[18]  Elad Hoffer,et al.  Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[19]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[20]  Hoi-Jun Yoo,et al.  UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[21]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Xin Wang,et al.  SkipNet: Learning Dynamic Routing in Convolutional Networks , 2017, ECCV.

[23]  Yuangang Wang,et al.  A Highly Parallel and Energy Efficient Three-Dimensional Multilayer CMOS-RRAM Accelerator for Tensorized Neural Network , 2018, IEEE Transactions on Nanotechnology.

[24]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[25]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[26]  Yue Wang,et al.  SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost Computation , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[27]  Asit K. Mishra,et al.  Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy , 2017, ICLR.

[28]  Jun Yang,et al.  Graph-Adaptive Pruning for Efficient Inference of Convolutional Neural Networks , 2018, ArXiv.

[29]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[30]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[32]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[33]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[34]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[35]  Pengfei Xu,et al.  E-Train: Training State-of-the-art CNNs with Over 80% Less Energy , 2019 .

[36]  Yue Wang,et al.  Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference , 2020, AAAI.

[37]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Yang Zhang,et al.  The Lottery Ticket Hypothesis for Pre-trained BERT Networks , 2020, NeurIPS.

[39]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[40]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[41]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[42]  Shiyu Chang,et al.  The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Chaojian Li,et al.  HALO: Hardware-Aware Learning to Optimize , 2020, ECCV.

[44]  Yue Wang,et al.  Drawing early-bird tickets: Towards more efficient training of deep networks , 2019, ICLR.

[45]  Tianlong Chen,et al.  Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference , 2020, ICLR.

[46]  Andreas Moshovos,et al.  Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Xianglong Liu,et al.  Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Zhiqiang Shen,et al.  Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[50]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[51]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[52]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[53]  Steven K. Esser,et al.  Learned Step Size Quantization , 2019, ICLR.

[54]  Chaojian Li,et al.  ShiftAddNet: A Hardware-Inspired Deep Network , 2020, NeurIPS.

[55]  Kwang-Ting Cheng,et al.  Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization , 2019, NeurIPS.

[56]  Youngwoo Kim,et al.  A 2.1TFLOPS/W Mobile Deep RL Accelerator with Transposable PE Array and Experience Compression , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[57]  Shuang Wu,et al.  Training High-Performance and Large-Scale Deep Neural Networks with Full 8-bit Integers , 2020, Neural Networks.

[58]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[59]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[60]  Vivienne Sze,et al.  Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Yue Wang,et al.  Dual Dynamic Inference: Enabling More Efficient, Adaptive, and Controllable Deep Inference , 2019, IEEE Journal of Selected Topics in Signal Processing.

[62]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[63]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.