FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly.Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%~8% in both error-free and error-injected situations).

[1]  Dingwen Tao,et al.  Silent Data Corruption Resilient Two-sided Matrix Factorizations , 2017, PPoPP.

[2]  Franck Cappello,et al.  Improving performance of iterative methods by lossy checkponting , 2018, HPDC.

[3]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[4]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[5]  Joon-Sung Yang,et al.  DRIS-3: Deep Neural Network Reliability Improvement Scheme in 3D Die-Stacked Memory based on Fault Analysis , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[6]  Franck Cappello,et al.  FT-iSort: efficient fault tolerance for introsort , 2019, SC.

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Dingwen Tao,et al.  TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs , 2019, ICS.

[10]  Kartheek Rangineni,et al.  ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[11]  Gerd Ascheid,et al.  An Efficient Bit-Flip Resilience Optimization Method for Deep Neural Networks , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Jason Cong,et al.  Minimizing Computation in Convolutional Neural Networks , 2014, ICANN.

[13]  Gu-Yeon Wei,et al.  Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[14]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[15]  Alois Knoll,et al.  Uncertainty Estimation for Deep Neural Object Detectors in Safety-Critical Applications , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[16]  Shuaiwen Song,et al.  Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Osman S. Unsal,et al.  On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation , 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[19]  Herbert Bos,et al.  Exploiting Correcting Codes: On the Effectiveness of ECC Memory Against Rowhammer Attacks , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[20]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Swaroop Ghosh,et al.  Sensitivity based Error Resilient Techniques for Energy Efficient Deep Neural Network Accelerators , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[22]  Onur Mutlu,et al.  EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using Approximate DRAM , 2019, MICRO.

[23]  Shuaiwen Song,et al.  New-Sum: A Novel Online ABFT Scheme For General Iterative Methods , 2016, HPDC.

[24]  Dingwen Tao,et al.  Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra , 2016, HPDC.

[25]  Dingwen Tao,et al.  Correcting soft errors online in fast fourier transform , 2017, SC.

[26]  Herbert Bos,et al.  Flip Feng Shui: Hammering a Needle in the Software Stack , 2016, USENIX Security Symposium.

[27]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jaume Abella,et al.  Selective replication: A lightweight technique for soft errors , 2009, TOCS.

[29]  Franck Cappello,et al.  DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression , 2019, HPDC.

[30]  Dong Li,et al.  Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[31]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[32]  Muhammad Shafique,et al.  Building Robust Machine Learning Systems: Current Progress, Research Challenges, and Opportunities , 2019, DAC.

[33]  Osman S. Unsal,et al.  Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Dingwen Tao Fault Tolerance for Iterative Methods in High-Performance Computing , 2018 .

[36]  Q. Wang,et al.  A versatile method of discrete convolution and FFT (DC-FFT) for contact analyses , 2000 .

[37]  Dhabaleswar K. Panda,et al.  An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures , 2017, MLHPC@SC.

[38]  Huai Li,et al.  Artificial convolution neural network for medical image pattern recognition , 1995, Neural Networks.

[39]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[40]  Kai Zhao,et al.  Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Eric P. Xing,et al.  Fault Tolerance in Iterative-Convergent Machine Learning , 2018, ICML.

[42]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[43]  Zizhong Chen,et al.  A survey of power and energy efficient techniques for high performance numerical linear algebra operations , 2014, Parallel Comput..

[44]  Rick Salay,et al.  An Analysis of ISO 26262: Using Machine Learning Safely in Automotive Software , 2017, ArXiv.

[45]  Dingwen Tao,et al.  Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity , 2020, ICPP.

[46]  Xiang Gu,et al.  Tolerating Soft Errors in Deep Learning Accelerators with Reliable On-Chip Memory Designs , 2018, 2018 IEEE International Conference on Networking, Architecture and Storage (NAS).

[47]  Deliang Fan,et al.  Bit-Flip Attack: Crushing Neural Network With Progressive Bit Search , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Luigi Carro,et al.  Evaluation and Mitigation of Soft-Errors in Neural Network-Based Object Detection in Three GPU Architectures , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[49]  Zizhong Chen,et al.  Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[50]  Simon Burton,et al.  Making the Case for Safety of Machine Learning in Highly Automated Driving , 2017, SAFECOMP Workshops.

[51]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[52]  Gu-Yeon Wei,et al.  MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation , 2019, MICRO.

[53]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[54]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[55]  John Paul Walters,et al.  A practical characterization of a NASA SpaceCube application through fault emulation and laser testing , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[56]  Jieyang Chen,et al.  TSM2X: High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs. , 2020 .

[57]  Fangfang Xia,et al.  CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research , 2018, BMC Bioinformatics.

[58]  Gisbert Schneider,et al.  Deep Learning in Drug Discovery , 2016, Molecular informatics.

[59]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[60]  Jeffrey S. Vetter,et al.  Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory , 2016, HPDC.