论文信息 - APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC)1 to fully exploit quantization benefits on Ampere GPU Tensor Cores. Specifically, APNN-TC first incorporates a novel emulation algorithm to support arbitrary short bit-width computation with int1 compute primitives and XOR/AND Boolean operations. Second, APNN-TC integrates arbitrary precision layer designs to efficiently map our emulation algorithm to Tensor Cores with novel batching strategies and specialized memory organization. Third, APNN-TC embodies a novel arbitrary precision NN design to minimize memory access across layers and further improve performance. Extensive evaluations show that APNN-TC can achieve significant speedup over CUTLASS kernels and various NN models, such as ResNet and VGG.

[1] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2] Maciej Urbanski,et al. Intel Nervana Neural Network Processor-T (NNP-T) Fused Floating Point Many-Term Dot Product , 2020, 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH).

[3] Eunhyeok Park,et al. Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[4] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5] Yanzhi Wang,et al. Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization , 2020, IJCAI.

[6] Jian Sun,et al. Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Henk Corporaal,et al. X: A Comprehensive Analytic Model for Parallel Machines , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9] Martin C. Herbordt,et al. O3BNN: an out-of-order architecture for high-performance binarized neural network inference with fine-grained pruning , 2019, ICS.

[10] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11] Lei Deng,et al. Boosting Deep Neural Network Efficiency with Dual-Module Inference , 2020, ICML.

[12] Olivier Giroux,et al. Volta: Performance and Programmability , 2018, IEEE Micro.

[13] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14] Ran El-Yaniv,et al. Binarized Neural Networks , 2016, ArXiv.

[15] Walter Stechele,et al. BinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wear and Positioning Predictor on Edge Devices , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[16] Yoshua Bengio,et al. BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[17] H. T. Kung,et al. Embedded Binarized Neural Networks , 2017, EWSN.

[18] Marco Maggioni,et al. Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.

[19] Dacheng Tao,et al. Searching for Low-Bit Weights in Quantized Neural Networks , 2020, NeurIPS.

[20] Zhijian Liu,et al. HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Wei Wu,et al. O3BNN-R: An Out-of-Order Architecture for High-Performance and Regularized BNN Inference , 2021, IEEE Transactions on Parallel and Distributed Systems.

[22] Jian Cheng,et al. Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[25] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26] Saibal Mukhopadhyay,et al. Efficient Object Detection Using Embedded Binarized Neural Networks , 2018, J. Signal Process. Syst..

[27] Zhiwei Xiong,et al. Tracking by Instance Detection: A Meta-Learning Approach , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Ang Li,et al. Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs , 2021, IEEE Transactions on Parallel and Distributed Systems.

[29] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[30] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Shiyue Zhang,et al. ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization , 2020, EMNLP.

[32] Tor M. Aamodt,et al. Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[33] Shuchang Zhou,et al. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[34] Boyuan Feng,et al. DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions , 2021, ArXiv.

[35] Luca Benini,et al. XpulpNN: Accelerating Quantized Neural Networks on RISC-V Processors Through ISA Extensions , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[37] Ronny Krashinsky,et al. NVIDIA A100 Tensor Core GPU: Performance and Innovation , 2021, IEEE Micro.

[38] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[39] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Wei Niu,et al. PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-time Execution on Mobile Devices , 2020, AAAI.

[41] Henk Corporaal,et al. Critical points based register-concurrency autotuning for GPUs , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[42] Mingkui Tan,et al. Training Quantized Neural Networks With a Full-Precision Auxiliary Module , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Henk Corporaal,et al. Transit: A Visual Analytical Model for Multithreaded Machines , 2015, HPDC.

[44] Jieping Ye,et al. AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates , 2020, AAAI.

[45] G. Hua,et al. LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[46] Martin C. Herbordt,et al. BSTC: a novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets , 2019, SC.