论文信息 - Automatic generation of high-performance quantized machine learning kernels

Automatic generation of high-performance quantized machine learning kernels

Quantization optimizes machine learning inference for resource constrained environments by reducing the precision of its computation. In the extreme, even single-bit computations can produce acceptable results at dramatically lower cost. But this ultra-low-precision quantization is difficult to exploit because extracting optimal performance requires hand-tuning both high-level scheduling decisions and low-level implementations. As a result, practitioners settle for a few predefined quantized kernels, sacrificing optimality and restricting their ability to adapt to new hardware. This paper presents a new automated approach to implementing quantized inference for machine learning models. We integrate the choice of how to lay out quantized values into the scheduling phase of a machine learning compiler, allowing it to be optimized in concert with tiling and parallelization decisions. After scheduling, we use program synthesis to automatically generate efficient low-level operator implementations for the desired precision and data layout. We scale up synthesis using a novel reduction sketch that exploits the structure of matrix multiplication. On a ResNet18 model, our generated code outperforms an optimized floating-point baseline by up to 3.9×, and a state-of-the-art quantized implementation by up to 16.6×.

[1] Ran El-Yaniv,et al. Binarized Neural Networks , 2016, NIPS.

[2] Emina Torlak,et al. A lightweight symbolic virtual machine for solver-aided host languages , 2014, PLDI.

[3] Amrita Mazumdar,et al. Exploring computation-communication tradeoffs in camera systems , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[4] Swagath Venkataramani,et al. Accurate and Efficient 2-bit Quantized Neural Networks , 2019, MLSys.

[5] H. Massalin. Superoptimizer: a look at the smallest program , 1987, ASPLOS.

[6] Vivienne Sze,et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2017, IEEE Journal of Solid-State Circuits.

[7] Thierry Moreau,et al. Learning to Optimize Tensor Programs , 2018, NeurIPS.

[8] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[9] Sanjit A. Seshia,et al. Combinatorial sketching for finite programs , 2006, ASPLOS XII.

[10] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11] Sumit Gulwani,et al. Synthesis of loop-free programs , 2011, PLDI '11.

[12] Alastair David Reid. Who guards the guards? formal validation of the Arm v8-m architecture specification , 2017, Proc. ACM Program. Lang..

[13] Rastislav Bodík,et al. Chlorophyll : Synthesis-Aided Compiler for Low-Power Spatial Architectures by Phitchaya Mangpo Phothilimthana , 2015 .

[14] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[15] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Emina Torlak,et al. Growing solver-aided languages with rosette , 2013, Onward!.

[17] Emina Torlak,et al. Optimizing synthesis with metasketches , 2016, POPL.

[18] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[19] Philip Heng Wai Leong,et al. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[20] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[21] Thierry Moreau,et al. Automating Generation of Low Precision Deep Learning Operators , 2018, ArXiv.

[22] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23] Jian Sun,et al. Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Ran El-Yaniv,et al. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[25] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[26] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27] Hadi Esmaeilzadeh,et al. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[28] Andrew Zisserman,et al. Deep Face Recognition , 2015, BMVC.

[29] Ali Farhadi,et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[30] Patrick Judd,et al. Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31] Bertrand A. Maher,et al. Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.

[32] Shuchang Zhou,et al. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[33] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[34] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35] Thierry Moreau,et al. A Hardware–Software Blueprint for Flexible Deep Learning Specialization , 2018, IEEE Micro.

[36] Magnus Jahre,et al. Towards efficient quantized neural network inference on mobile devices: work-in-progress , 2017, CASES.

[37] Magnus Själander,et al. BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[38] Hari Angepat,et al. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[39] Jiangming Jin,et al. BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[40] Yangqing Jia,et al. High performance ultra-low-precision convolutions on mobile devices , 2017, ArXiv.

[41] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[42] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.