论文信息 - Precision Batching: Bitserial Decomposition for Efficient Neural Network Inference on GPUs

Precision Batching: Bitserial Decomposition for Efficient Neural Network Inference on GPUs

We present PrecisionBatching, a quantized inference algorithm for speeding up neural network inference on traditional hardware platforms at low bitwidths. PrecisionBatching is based on the following insights: 1) neural network inference with low batch sizes on traditional hardware architectures (e.g: GPUs) is memory bound, 2) activation precision is critical to improving quantized model quality and 3) matrix-vector multiplication can be decomposed into binary matrix-matrix multiplications, enabling quantized inference with higher precision activations at the cost of more arithmetic operations. Combining these three insights, PrecisionBatching enables inference at extreme quantization levels (< 8 bits) by shifting a memory bound problem to a compute bound problem and achieves higher compute efficiency and runtime speedup at fixed accuracy thresholds against standard quantized inference methods. Across a variety of applications (MNIST, language modeling, natural language inference, reinforcement learning) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8× on a GPU within a < 1 - 5% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5 × - 2× at the same error tolerance.

[1] Thierry Moreau,et al. Automating Generation of Low Precision Deep Learning Operators , 2018, ArXiv.

[2] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[3] Jakub W. Pachocki,et al. Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[4] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[5] Yuval Tassa,et al. DeepMind Control Suite , 2018, ArXiv.

[6] Vivienne Sze,et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[7] Patrick Judd,et al. Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8] Thierry Moreau,et al. Automatic generation of high-performance quantized machine learning kernels , 2020, CGO.

[9] Shuohang Wang,et al. Learning Natural Language Inference with LSTM , 2015, NAACL.

[10] Shuchang Zhou,et al. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[11] Zhiru Zhang,et al. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[12] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[13] Daisuke Miyashita,et al. LogNet: Energy-efficient neural networks using logarithmic computation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Hongbin Zha,et al. Alternating Multi-bit Quantization for Recurrent Neural Networks , 2018, ICLR.

[15] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[16] Maximilian Lam,et al. Quantized Reinforcement Learning (QUARL) , 2019, ArXiv.

[17] Yoshua Bengio,et al. Neural Networks with Few Multiplications , 2015, ICLR.

[18] Jeff Johnson,et al. Rethinking floating point for deep learning , 2018, ArXiv.

[19] Sergio Gomez Colmenarejo,et al. Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[20] Carole-Jean Wu,et al. DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[21] Andreas Moshovos,et al. Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22] Swagath Venkataramani,et al. PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[23] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24] Chris Dyer,et al. On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[25] Hubert Eichner,et al. Federated Learning for Mobile Keyboard Prediction , 2018, ArXiv.

[26] Ran El-Yaniv,et al. Binarized Neural Networks , 2016, ArXiv.

[27] Atri Rudra,et al. Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations , 2019, ICML.

[28] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[29] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[30] Alexander Rush,et al. AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference , 2019, ArXiv.

[31] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32] Magnus Jahre,et al. Streamlined Deployment for Quantized Neural Networks , 2017, ArXiv.

[33] Hadi Esmaeilzadeh,et al. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[34] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[35] Mayank Bansal,et al. ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst , 2018, Robotics: Science and Systems.

[36] Ran El-Yaniv,et al. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..