论文信息 - Safe Overclocking for CNN Accelerators Through Algorithm-Level Error Detection

Safe Overclocking for CNN Accelerators Through Algorithm-Level Error Detection

In this article, we propose a technique for improving the efficiency of convolutional neural network hardware accelerators based on timing speculation (overclocking) and fault tolerance. We augment the accelerator with a lightweight error detection mechanism to protect against timing errors in convolution layers, enabling aggressive timing speculation. The error detection mechanism we have developed works at the algorithm-level, utilizing algebraic properties of the computation, allowing the full implementation to be realized using high-level synthesis tools. Our prototype on ZC706 demonstrated up to 60% higher throughput with negligible area overhead for various wordlength implementations.

[1] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2] Tomofumi Yuki,et al. Folklore Confirmed: Compiling for Speed = Compiling for Energy , 2013, LCPC.

[3] David Blaauw,et al. Bubble Razor: Eliminating Timing Margins in an ARM Cortex-M3 Processor in 45 nm CMOS Using Architecturally Independent Error Detection and Correction , 2013, IEEE Journal of Solid-State Circuits.

[4] Yiran Chen,et al. An FPGA Design Framework for CNN Sparsification and Acceleration , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[5] Peter Y. K. Cheung,et al. Dynamic voltage & frequency scaling with online slack measurement , 2014, FPGA.

[6] Martin C. Herbordt,et al. A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[7] Pradip Bose,et al. Safe limits on voltage reduction efficiency in GPUs: A direct measurement approach , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8] David Blaauw,et al. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[9] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[10] Mohammad Hosseinabady,et al. Energy Optimization in Commercial FPGAs with Voltage, Frequency and Logic Scaling , 2016, IEEE Transactions on Computers.

[11] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[12] Stanislaw J. Piestrak,et al. Design of Fault-Secure Transposed FIR Filters Protected Using Residue Codes , 2014, 2014 17th Euromicro Conference on Digital System Design.

[13] Peter Y. K. Cheung,et al. SMI: Slack Measurement Insertion for online timing monitoring in FPGAs , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[14] DongarraJack,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012 .

[15] Peter Y. K. Cheung,et al. Timing speculation in FPGAs: Probabilistic inference of data dependent failure rates , 2011, 2011 International Conference on Field-Programmable Technology.

[16] Viktor Prasanna,et al. Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[17] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[18] Alan D. George,et al. Overhead and reliability analysis of algorithm-based fault tolerance in FPGA systems , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[19] Christos-Savvas Bouganis,et al. Toolflows for Mapping Convolutional Neural Networks on FPGAs , 2018, ACM Comput. Surv..

[20] Chao Wang,et al. CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21] Peter Y. K. Cheung,et al. Achieving low-overhead fault tolerance for parallel accelerators with dynamic partial reconfiguration , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[22] André DeHon,et al. GROK-LAB: Generating Real On-chip Knowledge for Intra-cluster Delays Using Timing Extraction , 2015, TRETS.

[23] George A. Constantinides,et al. Accuracy-Performance Tradeoffs on an FPGA through Overclocking , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[24] Guanpeng Li,et al. Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[26] Niraj K. Jha,et al. Algorithm-Based Fault Tolerance for FFT Networks , 1994, IEEE Trans. Computers.

[27] Shengen Yan,et al. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[28] Prithviraj Banerjee,et al. Algorithm-Based Error Detection Schemes for Iterative Solution of Partial Differential Equations , 1996, IEEE Trans. Computers.

[29] Thomas Hérault,et al. Composing resilience techniques: ABFT, periodic and incremental checkpointing , 2015, Int. J. Netw. Comput..

[30] Jose Nunez-Yanez,et al. Energy Proportional Neural Network Inference with Adaptive Voltage and Frequency Scaling , 2019, IEEE Transactions on Computers.