APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC)1 to fully exploit quantization benefits on Ampere GPU Tensor Cores. Specifically, APNN-TC first incorporates a novel emulation algorithm to support arbitrary short bit-width computation with int1 compute primitives and XOR/AND Boolean operations. Second, APNN-TC integrates arbitrary precision layer designs to efficiently map our emulation algorithm to Tensor Cores with novel batching strategies and specialized memory organization. Third, APNN-TC embodies a novel arbitrary precision NN design to minimize memory access across layers and further improve performance. Extensive evaluations show that APNN-TC can achieve significant speedup over CUTLASS kernels and various NN models, such as ResNet and VGG.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Maciej Urbanski,et al.  Intel Nervana Neural Network Processor-T (NNP-T) Fused Floating Point Many-Term Dot Product , 2020, 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH).

[3]  Eunhyeok Park,et al.  Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[4]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  Yanzhi Wang,et al.  Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization , 2020, IJCAI.

[6]  Jian Sun,et al.  Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Henk Corporaal,et al.  X: A Comprehensive Analytic Model for Parallel Machines , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Martin C. Herbordt,et al.  O3BNN: an out-of-order architecture for high-performance binarized neural network inference with fine-grained pruning , 2019, ICS.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Lei Deng,et al.  Boosting Deep Neural Network Efficiency with Dual-Module Inference , 2020, ICML.

[12]  Olivier Giroux,et al.  Volta: Performance and Programmability , 2018, IEEE Micro.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[15]  Walter Stechele,et al.  BinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wear and Positioning Predictor on Edge Devices , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[16]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[17]  H. T. Kung,et al.  Embedded Binarized Neural Networks , 2017, EWSN.

[18]  Marco Maggioni,et al.  Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.

[19]  Dacheng Tao,et al.  Searching for Low-Bit Weights in Quantized Neural Networks , 2020, NeurIPS.

[20]  Zhijian Liu,et al.  HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei Wu,et al.  O3BNN-R: An Out-of-Order Architecture for High-Performance and Regularized BNN Inference , 2021, IEEE Transactions on Parallel and Distributed Systems.

[22]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[25]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26]  Saibal Mukhopadhyay,et al.  Efficient Object Detection Using Embedded Binarized Neural Networks , 2018, J. Signal Process. Syst..

[27]  Zhiwei Xiong,et al.  Tracking by Instance Detection: A Meta-Learning Approach , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ang Li,et al.  Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs , 2021, IEEE Transactions on Parallel and Distributed Systems.

[29]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Shiyue Zhang,et al.  ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization , 2020, EMNLP.

[32]  Tor M. Aamodt,et al.  Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[33]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[34]  Boyuan Feng,et al.  DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions , 2021, ArXiv.

[35]  Luca Benini,et al.  XpulpNN: Accelerating Quantized Neural Networks on RISC-V Processors Through ISA Extensions , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[37]  Ronny Krashinsky,et al.  NVIDIA A100 Tensor Core GPU: Performance and Innovation , 2021, IEEE Micro.

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Wei Niu,et al.  PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-time Execution on Mobile Devices , 2020, AAAI.

[41]  Henk Corporaal,et al.  Critical points based register-concurrency autotuning for GPUs , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[42]  Mingkui Tan,et al.  Training Quantized Neural Networks With a Full-Precision Auxiliary Module , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Henk Corporaal,et al.  Transit: A Visual Analytical Model for Multithreaded Machines , 2015, HPDC.

[44]  Jieping Ye,et al.  AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates , 2020, AAAI.

[45]  G. Hua,et al.  LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[46]  Martin C. Herbordt,et al.  BSTC: a novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets , 2019, SC.