Laius: An 8-Bit Fixed-Point CNN Hardware Inference Engine

Convolutional Neural Network (CNN) is one of the most effective neural network model for many classification tasks, such as voice recognition, computer vision and biological information processing. Unfortunately, Computation of CNN is both memory-intensive and computation-intensive, which brings a huge challenge to the design of the hardware accelerators. A large number of hardware accelerators for CNN inference are designed by the industry and the academia. Most of the engines are based on 32-bit floating point matrix multiplication, where the data precision is over-provisioned for inference job and the hardware cost are too high. In this paper, a 8-bit fixed-point LeNet inference engine (Laius) is designed and implemented on FPGA. In order to reduce the consumption of FPGA resource, we proposed a methodology to find the optimal bit-length for weight and bias in LeNet, which results in using 8-bit fixed point for most of the computation and using 16-bit fixed point for other computation. The PE (Processing Element) design is proposed. Pipelining and PE tiling technique is use to improve the performance of the inference engine. By theoretical analysis, we came to the conclusion that DSP resource in FPGA is the most critical resource, it should be carefully used during the design process. We implement the inference engine on Xilinx 485t FPGA. Experiment result shows that the designed LeNet inference engine can achieve 44.9 Gops throughput with 8-bit fixed-point operation after pipelining. Moreover, with only 1% loss of accuracy, the 8-bit fixed-point engine largely reduce 31.43% in latency, 87.01% in LUT consumption, 66.50% in BRAM consumption, 65.11% in DSP consumption and 47.95% reduction in power compared to a 32-bit fixed-point inference engine with the same structure.

[1]  Yuxing Tang,et al.  FixCaffe: Training CNN with Low Precision Arithmetic Operations by Fixed Point Caffe , 2017, APPT.

[2]  Wenguang Chen,et al.  NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[4]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[5]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[6]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[7]  Michael Ferdman,et al.  Overcoming resource underutilization in spatial CNN accelerators , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[8]  R. Sindhu Reddy,et al.  DLAU: A Scalable Deep Learning Accelerator Unit on FPGA , 2018 .

[9]  Hao Yu,et al.  A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks , 2017, ACM J. Emerg. Technol. Comput. Syst..

[10]  Michael Ferdman,et al.  Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[13]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[14]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[15]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[16]  Dong Wang,et al.  PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks , 2016, ArXiv.

[17]  Luca Benini,et al.  Curbing the roofline: a scalable and flexible architecture for CNNs on FPGA , 2016, Conf. Computing Frontiers.

[18]  Jason Cong,et al.  Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster , 2016, ISLPED.

[19]  Soheil Ghiasi,et al.  Design space exploration of FPGA-based Deep Convolutional Neural Networks , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).