Embracing Diversity: Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs

Use of reduced precisions in Deep Learning (DL) inference tasks has recently been shown to significantly improve accelerator performance and greatly reduce both model memory footprint and the required external memory bandwidth. With appropriate network retuning, reduced precision networks can achieve accuracy close or equal to that of full-precision floating-point models. Given the wide spectrum of precisions used in DL inference, FPGAs' ability to create custom bit-width datapaths gives them an advantage over other acceleration platforms in this domain. However, the embedded DSP blocks in the latest Intel and Xilinx FPGAs do not natively support precisions below 18-bit and thus can not efficiently pack low-precision multiplications, leaving the DSP blocks under-utilized. In this work, we present an enhanced DSP block that can efficiently pack 2× as many 9-bit and 4× as many 4-bit multiplications compared to the baseline Arria-10-like DSP block at the cost of 12% block area overhead which leads to only 0.6% total FPGA core area increase. We quantify the performance gains of using this enhanced DSP block in two state-of-the-art convolutional neural network accelerators on three different models: AlexNet, VGG-16, and ResNet-50. On average, the new DSP block enhanced the computational performance of the 8-bit and 4-bit accelerators by 1.32× and 1.6× and at the same time reduced the utilized chip area by 15% and 30% respectively.

[1]  Andrew C. Ling,et al.  An OpenCL™ Deep Learning Accelerator on Arria 10 , 2017, FPGA.

[2]  Paolo Ienne,et al.  Highly Versatile DSP Blocks for Improved FPGA Arithmetic Performance , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[3]  Wei Zhang,et al.  Fracturable DSP Block for Multi-context Reconfigurable Architectures , 2017, Circuits Syst. Signal Process..

[4]  Vaughn Betz,et al.  Quantifying the Gap Between FPGA and Custom CMOS to Aid Microarchitectural Design , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  Kyuyeon Hwang,et al.  Fixed-point feedforward deep neural network design using weights +1, 0, and −1 , 2014, 2014 IEEE Workshop on Signal Processing Systems (SiPS).

[6]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7]  Abhisek Kundu,et al.  Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point , 2017, ArXiv.

[8]  Bruce A. Wooley,et al.  A Two's Complement Parallel Array Multiplication Algorithm , 1973, IEEE Transactions on Computers.

[9]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[10]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[11]  Eriko Nurvitadhi,et al.  WRPN: Wide Reduced-Precision Networks , 2017, ICLR.

[12]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[13]  Soheil Ghiasi,et al.  Hardware-oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.

[14]  Martin Langhammer,et al.  Floating-Point DSP Block Architecture for FPGAs , 2015, FPGA.

[15]  Martin Margala,et al.  Exploration of Low Numeric Precision Deep Learning Inference Using Intel® FPGAs , 2018, FCCM.

[16]  J. L. Holt,et al.  Back propagation simulations using limited precision calculations , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[17]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[18]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[19]  Vaughn Betz,et al.  Automatic circuit design and modelling for heterogeneous FPGAs , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[20]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[21]  Vaughn Betz,et al.  Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS , 2014, 2014 International Conference on Field-Programmable Technology (FPT).