Safe Overclocking for CNN Accelerators Through Algorithm-Level Error Detection

In this article, we propose a technique for improving the efficiency of convolutional neural network hardware accelerators based on timing speculation (overclocking) and fault tolerance. We augment the accelerator with a lightweight error detection mechanism to protect against timing errors in convolution layers, enabling aggressive timing speculation. The error detection mechanism we have developed works at the algorithm-level, utilizing algebraic properties of the computation, allowing the full implementation to be realized using high-level synthesis tools. Our prototype on ZC706 demonstrated up to 60% higher throughput with negligible area overhead for various wordlength implementations.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Tomofumi Yuki,et al.  Folklore Confirmed: Compiling for Speed = Compiling for Energy , 2013, LCPC.

[3]  David Blaauw,et al.  Bubble Razor: Eliminating Timing Margins in an ARM Cortex-M3 Processor in 45 nm CMOS Using Architecturally Independent Error Detection and Correction , 2013, IEEE Journal of Solid-State Circuits.

[4]  Yiran Chen,et al.  An FPGA Design Framework for CNN Sparsification and Acceleration , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[5]  Peter Y. K. Cheung,et al.  Dynamic voltage & frequency scaling with online slack measurement , 2014, FPGA.

[6]  Martin C. Herbordt,et al.  A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[7]  Pradip Bose,et al.  Safe limits on voltage reduction efficiency in GPUs: A direct measurement approach , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[9]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[10]  Mohammad Hosseinabady,et al.  Energy Optimization in Commercial FPGAs with Voltage, Frequency and Logic Scaling , 2016, IEEE Transactions on Computers.

[11]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[12]  Stanislaw J. Piestrak,et al.  Design of Fault-Secure Transposed FIR Filters Protected Using Residue Codes , 2014, 2014 17th Euromicro Conference on Digital System Design.

[13]  Peter Y. K. Cheung,et al.  SMI: Slack Measurement Insertion for online timing monitoring in FPGAs , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[14]  DongarraJack,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012 .

[15]  Peter Y. K. Cheung,et al.  Timing speculation in FPGAs: Probabilistic inference of data dependent failure rates , 2011, 2011 International Conference on Field-Programmable Technology.

[16]  Viktor Prasanna,et al.  Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[17]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[18]  Alan D. George,et al.  Overhead and reliability analysis of algorithm-based fault tolerance in FPGA systems , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[19]  Christos-Savvas Bouganis,et al.  Toolflows for Mapping Convolutional Neural Networks on FPGAs , 2018, ACM Comput. Surv..

[20]  Chao Wang,et al.  CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Peter Y. K. Cheung,et al.  Achieving low-overhead fault tolerance for parallel accelerators with dynamic partial reconfiguration , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[22]  André DeHon,et al.  GROK-LAB: Generating Real On-chip Knowledge for Intra-cluster Delays Using Timing Extraction , 2015, TRETS.

[23]  George A. Constantinides,et al.  Accuracy-Performance Tradeoffs on an FPGA through Overclocking , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[24]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[26]  Niraj K. Jha,et al.  Algorithm-Based Fault Tolerance for FFT Networks , 1994, IEEE Trans. Computers.

[27]  Shengen Yan,et al.  Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[28]  Prithviraj Banerjee,et al.  Algorithm-Based Error Detection Schemes for Iterative Solution of Partial Differential Equations , 1996, IEEE Trans. Computers.

[29]  Thomas Hérault,et al.  Composing resilience techniques: ABFT, periodic and incremental checkpointing , 2015, Int. J. Netw. Comput..

[30]  Jose Nunez-Yanez,et al.  Energy Proportional Neural Network Inference with Adaptive Voltage and Frequency Scaling , 2019, IEEE Transactions on Computers.