论文信息 - High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands

High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands

Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD matrix multiplication instructions for symmetric bit-width operands can support operands of mixed precision by zero- or sign-extending the narrow operand to match the size of the other operands, they cannot exploit the benefit of narrow bit-width of one of the operands. We propose a new SIMD matrix multiplication instruction that uses mixed precision on its inputs (8- and 4-bit operands) and accumulates product values into narrower 16-bit output accumulators, in turn allowing the SIMD operation at 128-bit vector width to process a greater number of data elements per instruction to improve processing throughput and memory bandwidth utilization without increasing the register read- and write-port bandwidth in CPUs. The proposed asymmetric-operand-size SIMD instruction offers 2x improvement in throughput of matrix multiplication in comparison to throughput obtained using existing symmetric-operand-size instructions while causing negligible (0.05%) overflow from 16-bit accumulators for representative machine learning workloads. The asymmetric-operand-size instruction not only can improve matrix multiplication throughput in CPUs, but also can be effective to support multiply-and-accumulate (MAC) operation between 8- and 4-bit operands in state-of-the-art DNN hardware accelerators (e.g., systolic array microarchitecture in Google TPU, etc.) and offer similar improvement in matrix multiply performance seamlessly without violating the various implementation constraints. We demonstrate how a systolic array architecture designed for symmetric-operand-size instructions could be modified to support an asymmetric-operand-sized instruction.

Matthew Mattina | Dibakar Gope | Jesse Beu

[1] Yurong Chen,et al. Network Sketching: Exploiting Binary Structure in Deep CNNs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Matthew Mattina,et al. Compressing RNNs for IoT devices by 15-38x using Kronecker Products , 2019, ArXiv.

[3] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[4] Daniel Soudry,et al. Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[5] Houqiang Li,et al. Quantization Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Yan Ren,et al. Multi-Precision Quantized Neural Networks via Encoding Decomposition of -1 and +1 , 2019, AAAI.

[7] Anima Anandkumar,et al. StrassenNets: Deep learning with a multiplication budget , 2017, ICML.

[8] Matthew Mattina,et al. Ternary MobileNets via Per-Layer Hybrid Filter Banks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9] Xianglong Liu,et al. Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11] Matthew Mattina,et al. Ternary Hybrid Neural-Tree Networks for Highly Constrained IoT Applications , 2019, MLSys.

[12] Jae-Joon Han,et al. Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] G. Hua,et al. LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[14] Nikos Nikoleris,et al. The gem5 Simulator: Version 20.0+ , 2020, ArXiv.

[15] Ian D. Reid,et al. Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Andrew Zisserman,et al. Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[17] Xin Dong,et al. Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Matthew Mattina,et al. Run-Time Efficient RNN Compression for Inference on Edge Devices , 2019, 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2).

[19] Dibakar Gope,et al. Aggressive Compression of MobileNets Using Hybrid Ternary Layers , 2019 .

[20] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[21] Max Welling,et al. Relaxed Quantization for Discretized Neural Networks , 2018, ICLR.

[22] Matthew Mattina,et al. Pushing the limits of RNN Compression , 2019, ArXiv.