ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition

The ever-growing parameter size and computation cost of Convolutional Neural Network (CNN) models hinder their deployment onto resource-constrained platforms. Network pruning techniques are proposed to remove the redundancy in CNN parameters and produce a sparse model. Sparse-aware accelerators are also proposed to reduce the computation cost and memory bandwidth requirements of inference by leveraging the model sparsity. The irregularity of sparse patterns, however, limits the efficiency of those designs. Researchers proposed to address this issue by creating a regular sparsity pattern through hardware-aware pruning algorithms. However, the pruning rate of these solutions is largely limited by the enforced sparsity patterns. This limitation motivates us to explore other compression methods beyond pruning. With two decoupled computation stages, we found that kernel decomposition could potentially take the processing of the sparse pattern off from the critical path of inference and achieve a high compression ratio without enforcing the sparse patterns. To exploit these advantages, we propose ESCALATE, an algorithm-hardware co-design approach based on kernel decomposition. At algorithm level, ESCALATE reorganizes the two computation stages of the decomposed convolution to enable a stream processing of the intermediate feature map. We proposed a hybrid quantization to exploit the different reuse frequency of each part of the decomposed weight. At architecture level, ESCALATE proposes a novel ‘Basis-First’ dataflow and its corresponding microarchitecture design to maximize the benefits brought by the decomposed convolution. We evaluate ESCALATE with four representative CNN models on both CIFAR-10 and ImageNet datasets and compare it against previous sparse accelerators and pruning algorithms. Results show that ESCALATE can achieve up to 325 × and 11 × compression ratio for models on CIFAR-10 and ImageNet, respectively. Comparing with previous dense and sparse accelerators, ESCALATE accelerator averagely boosts the energy efficiency by 8.3 × and 3.77 ×, and reduces the latency by 17.9 × and 2.16 ×, respectively.

[1]  Dylan Malone Stuart,et al.  Laconic Deep Learning Inference Acceleration , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[2]  T. N. Vijaykumar,et al.  SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks , 2019, MICRO.

[3]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[4]  Jiayu Li,et al.  ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers , 2018, ASPLOS.

[5]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[6]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[7]  Ji Liu,et al.  Lossless CNN Channel Pruning via Gradient Resetting and Convolutional Re-parameterization , 2020, ArXiv.

[8]  Ruby B. Lee,et al.  Fast Bit Gather, Bit Scatter and Bit Permutation Instructions for Commodity Microprocessors , 2008, J. Signal Process. Syst..

[9]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[10]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[11]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[12]  Yanzhi Wang,et al.  A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers , 2018, ECCV.

[13]  Jian Sun,et al.  Accelerating Very Deep Convolutional Networks for Classification and Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[15]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[16]  Andrew B. Kahng,et al.  CACTI 7 , 2017, ACM Trans. Archit. Code Optim..

[17]  Song Han,et al.  Trained Ternary Quantization , 2016, ICLR.

[18]  Patrick Judd,et al.  Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks , 2019, ASPLOS.

[19]  Hai Li,et al.  PENNI: Pruned Kernel Sharing for Efficient CNN Inference , 2020, ICML.

[20]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[21]  Aamer Jaleel,et al.  ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Ivan V. Oseledets,et al.  Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[25]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[26]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[27]  Ali Farhadi,et al.  Soft Threshold Weight Reparameterization for Learnable Sparsity , 2020, ICML.

[28]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[29]  Vivienne Sze,et al.  Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2017, IEEE Journal of Solid-State Circuits.

[30]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Brucek Khailany,et al.  Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[32]  Yanzhi Wang,et al.  PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning , 2020, ASPLOS.

[33]  Xuehai Qian,et al.  Non-Structured DNN Weight Pruning--Is It Beneficial in Any Platform? , 2021, IEEE transactions on neural networks and learning systems.

[34]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[37]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[38]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[39]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Andreas Moshovos,et al.  Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).