DARB: A Density-Adaptive Regular-Block Pruning for Deep Neural Networks

The rapidly growing parameter volume of deep neural networks (DNNs) hinders the artificial intelligence applications on resource constrained devices, such as mobile and wearable devices. Neural network pruning, as one of the mainstream model compression techniques, is under extensive study to reduce the model size and thus the amount of computation. And thereby, the state-of-the-art DNNs are able to be deployed on those devices with high runtime energy efficiency. In contrast to irregular pruning that incurs high index storage and decoding overhead, structured pruning techniques have been proposed as the promising solutions. However, prior studies on structured pruning tackle the problem mainly from the perspective of facilitating hardware implementation, without diving into the deep to analyze the characteristics of sparse neural networks. The neglect on the study of sparse neural networks causes inefficient trade-off between regularity and pruning ratio. Consequently, the potential of structurally pruning neural networks is not sufficiently mined.In this work, we examine the structural characteristics of the irregularly pruned weight matrices, such as the diverse redundancy of different rows, the sensitivity of different rows to pruning, and the position characteristics of retained weights. By leveraging the gained insights as a guidance, we first propose the novel block-max weight masking (BMWM) method, which can effectively retain the salient weights while imposing high regularity to the weight matrix. As a further optimization, we propose a density-adaptive regular-block (DARB) pruning that can effectively take advantage of the intrinsic characteristics of neural networks, and thereby outperform prior structured pruning work with high pruning ratio and decoding efficiency. Our experimental results show that DARB can achieve 13× to 25× pruning ratio, which are 2.8× to 4.3× improvements than the state-of-the-art counterparts on multiple neural network models and tasks. Moreover, DARB can achieve 14.3× decoding efficiency than block pruning with higher pruning ratio.

[1]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[2]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[3]  Shenghuo Zhu,et al.  Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM , 2017, AAAI.

[4]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[5]  Jiayu Li,et al.  ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers , 2018, ASPLOS.

[6]  Chen Zhang,et al.  Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity , 2019, FPGA.

[7]  Alexander Keller,et al.  Rethinking Full Connectivity in Recurrent Neural Networks , 2019, ArXiv.

[8]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[11]  Eunhyeok Park,et al.  Weighted-Entropy-Based Quantization for Deep Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tehseen Zia,et al.  Long short-term memory recurrent neural network architectures for Urdu acoustic modeling , 2018, Int. J. Speech Technol..

[13]  Chunhua Deng,et al.  PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Fang Liu,et al.  Learning Intrinsic Sparse Structures within Long Short-term Memory , 2017, ICLR.

[15]  Hao Cheng,et al.  Adversarial Robustness vs. Model Compression, or Both? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[17]  Chong-Min Kyung,et al.  Efficient Neural Network Compression , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[19]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[21]  Qinru Qiu,et al.  C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs , 2018, FPGA.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Matthew Mattina,et al.  Compressing RNNs for IoT devices by 15-38x using Kronecker Products , 2019, ArXiv.

[27]  Qinru Qiu,et al.  Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework , 2018, AAAI.

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[30]  Hao Wang,et al.  Acceleration of LSTM With Structured Pruning Method on FPGA , 2019, IEEE Access.