Term Revealing: Furthering Quantization at Run Time on Quantized DNNs

We present a novel technique, called Term Revealing (TR), for furthering quantization at run time for improved performance of Deep Neural Networks (DNNs) already quantized with conventional quantization methods. TR operates on power-of-two terms in binary expressions of values. In computing a dot-product computation, TR dynamically selects a fixed number of largest terms to use from the values of the two vectors in the dot product. By exploiting normal-like weight and data distributions typically present in DNNs, TR has a minimal impact on DNN model performance (i.e., accuracy or perplexity). We use TR to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We show an FPGA implementation that can use a small number of control bits to switch between conventional quantization and TR-enabled quantization with a negligible delay. To enhance TR efficiency further, we use a signed digit representation (SDR), as opposed to classic binary encoding with only nonnegative power-of-two terms. To perform conversion from binary to SDR, we develop an efficient encoding method called HESE (Hybrid Encoding for Signed Expressions) that can be performed in one pass looking at only two bits at a time. We evaluate TR with HESE encoded values on an MLP for MNIST, multiple CNNs for ImageNet, and an LSTM for Wikitext-2, and show significant reductions in inference computations (between 3-10x) compared to conventional quantization for the same level of model performance.

[1]  Swagath Venkataramani,et al.  Exploiting approximate computing for deep learning acceleration , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Song Han,et al.  Trained Ternary Quantization , 2016, ICLR.

[3]  Eunhyeok Park,et al.  Value-aware Quantization for Training and Inference of Neural Networks , 2018, ECCV.

[4]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[5]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Diederik P. Kingma,et al.  GPU Kernels for Block-Sparse Weights , 2017 .

[8]  H. T. Kung,et al.  Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization , 2018, ASPLOS.

[9]  Seungwon Lee,et al.  Quantization for Rapid Deployment of Deep Neural Networks , 2018, ArXiv.

[10]  Shengen Yan,et al.  Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Algirdas Avizienis,et al.  Signed-Digit Numbe Representations for Fast Parallel Arithmetic , 1961, IRE Trans. Electron. Comput..

[12]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  H. T. Kung,et al.  Full-stack optimization for accelerating CNNs using powers-of-two weights with FPGA validation , 2019, ICS.

[15]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[16]  Jian Cheng,et al.  Fixed-Point Factorized Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[18]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[19]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  H. T. Kung Why systolic architectures? , 1982, Computer.

[21]  Jiayu Li,et al.  ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers , 2018, ASPLOS.

[22]  Michael Ferdman,et al.  Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[23]  Eunhyeok Park,et al.  Weighted-Entropy-Based Quantization for Deep Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[25]  Alberto Delmas,et al.  Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How , 2018, ArXiv.

[26]  Gregory Frederick Diamos,et al.  Block-Sparse Recurrent Neural Networks , 2017, ArXiv.

[27]  C. Mitchell,et al.  Minimum weight modified signed-digit representations and fast exponentiation , 1989 .

[28]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[30]  Jian Cheng,et al.  From Hashing to CNNs: Training BinaryWeight Networks via Hashing , 2018, AAAI.

[31]  Martin C. Herbordt,et al.  BSTC: a novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets , 2019, SC.

[32]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[33]  Andrew D. Booth,et al.  A SIGNED BINARY MULTIPLICATION TECHNIQUE , 1951 .

[34]  Eunhyeok Park,et al.  Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[35]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[36]  W. Miceli,et al.  Photonic computing using the modified signed-digit number representation , 1986 .

[37]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[38]  Jinjun Xiong,et al.  DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[39]  Asit K. Mishra,et al.  Low Precision RNNs: Quantizing RNNs Without Losing Accuracy , 2017, ArXiv.

[40]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[41]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[42]  Andreas Moshovos,et al.  Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Kilian Q. Weinberger,et al.  CondenseNet: An Efficient DenseNet Using Learned Group Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[45]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[46]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[47]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.