Squeezing the Last MHz for CNN Acceleration on FPGAs

Neural networks especially the convolution neural networks (CNN) have become prevalent and numerous CNN accelerators have been developed to achieve higher performance. While clock frequency determines the operation speed and has direct influence on the performance of the accelerators, we propose to apply overclocking, a circuit optimization approach that enables higher clock frequency, on general CNN accelerators. This technique brings significant performance improvement, but it leads to moderate timing errors, wrong computing results and low prediction accuracy. By taking advantage of the inherent fault tolerance of neural networks, we opt to learn the computing errors together with the application data with additional on-accelerator training. In this case, the resulting models can be resilient to the errors and do not necessarily suffer considerable prediction accuracy loss. In addition, we also take the worst case of overclocking into consideration with a series of approaches ranging from fault detection to fault recovery in case of hardware crash. Finally, we demonstrate the use of overclocking on a CNN accelerator implemented on Xilinx KCU1500 with comprehensive experiments. The experiments show that overclocking in combination with the on-accelerator neural network training improves both the neural network performance and energy efficiency with small prediction accuracy loss.

[1]  K. Saban Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity , Bandwidth , and Power Efficiency , 2009 .

[2]  Dong Wang,et al.  PipeCNN: An OpenCL-based open-source FPGA accelerator for convolution neural networks , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[3]  George A. Constantinides,et al.  Accuracy-Performance Tradeoffs on an FPGA through Overclocking , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[4]  Kyuyeon Hwang,et al.  Fixed-point feedforward deep neural network design using weights +1, 0, and −1 , 2014, 2014 IEEE Workshop on Signal Processing Systems (SiPS).

[5]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[6]  Jason Cong,et al.  Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7]  Josep Torrellas,et al.  Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[8]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[9]  Jie Xu,et al.  DeepBurning: Automatic generation of FPGA-based learning accelerators for the Neural Network family , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[10]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11]  Augustus K. Uht Going beyond worst-case specs with TEAtime , 2004, Computer.

[12]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[13]  Sherief Reda,et al.  DRUM: A Dynamic Range Unbiased Multiplier for approximate applications , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).