BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization

Neural networks have demonstrably achieved state-of-the art accuracy using low-bitlength integer quantization, yielding both execution time and energy benefits on existing hardware designs that support short bitlengths. However, the question of finding the minimum bitlength for a desired accuracy remains open. We introduce a training method for minimizing inference bitlength at any granularity while maintaining accuracy. Namely, we propose a regularizer that penalizes large bitlength representations throughout the architecture and show how it can be modified to minimize other quantifiable criteria, such as number of operations or memory footprint. We demonstrate that our method learns thrifty representations while maintaining accuracy. With ImageNet, the method produces an average per layer bitlength of 4.13, 3.76 and 4.36 bits on AlexNet, ResNet18 and MobileNet V2 respectively, remaining within 2.0%, 0.5% and 0.5% of the base TOP-1 accuracy.

[1]  W. Marsden I and J , 2012 .

[2]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[4]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5]  Karthikeyan Sankaralingam,et al.  Dark silicon and the end of multicore scaling , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[6]  Andreas Moshovos,et al.  Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Natalie D. Enright Jerger,et al.  Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks , 2016, ICS.

[8]  Patrick Judd,et al.  Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks , 2019, ASPLOS.

[9]  Song Han,et al.  HAQ: Hardware-Aware Automated Quantization , 2018, ArXiv.

[10]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[11]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[12]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[13]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[14]  Hadi Esmaeilzadeh,et al.  ReLeQ: A Reinforcement Learning Approach for Deep Quantization of Neural Networks , 2018, ArXiv.

[15]  Steven K. Esser,et al.  Learned Step Size Quantization , 2019, ICLR.

[16]  Eunhyeok Park,et al.  Value-aware Quantization for Training and Inference of Neural Networks , 2018, ECCV.

[17]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Eunhyeok Park,et al.  Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[20]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[23]  Yvon Savaria,et al.  Bit-Slicing FPGA Accelerator for Quantized Neural Networks , 2019, 2019 IEEE International Symposium on Circuits and Systems (ISCAS).

[24]  Mostafa Mahmoud,et al.  Characterizing Sources of Ineffectual Computations in Deep Learning Networks , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[25]  Eriko Nurvitadhi,et al.  WRPN: Wide Reduced-Precision Networks , 2017, ICLR.

[26]  Shuicheng Yan,et al.  Dual Path Networks , 2017, NIPS.

[28]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Alberto Delmas,et al.  Dynamic Stripes: Exploiting the Dynamic Precision Requirements of Activation Values in Neural Networks , 2017, ArXiv.

[30]  Natalie D. Enright Jerger,et al.  Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets , 2015, ArXiv.

[31]  Patrick Judd,et al.  Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks , 2017, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[32]  Shwetak N. Patel,et al.  Heterogeneous Bitwidth Binarization in Convolutional Neural Networks , 2018, NeurIPS.

[33]  Alberto Delmas,et al.  DPRed: Making Typical Activation Values Matter In Deep Learning Computing , 2018, ArXiv.

[34]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[35]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[37]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[38]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[39]  Magnus Själander,et al.  BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).