HarDNN: Fine-Grained Vulnerability Evaluation and Protection for Convolutional Neural Networks

—As CNNs are increasingly being employed in high performance computing and safety-critical applications, ensuring they are reliable to transient hardware errors is important. Full duplication provides high reliability, but the overheads are prohibitively high for resource constrained systems. Fine- grained resilience evaluation and protection can provide a low-cost solution, but traditional methods for evaluation can be too slow. Traditional approaches use error injections and essentially discard information from experiments that do not corrupt outcomes. In this work, we replace the binary view of errors with a new continuous domain-specific metric based on cross-entropy loss to quantify corruptions, allowing for faster convergence of error analysis. This enables us to scale up to large networks. We study the effectiveness of this method using different error models and also compare with heuristics that aim to predict vulnerability quickly. We show that selective, fine-grained protection of the most vulnerable components of a CNN provides a significantly lower overhead solution than full duplication. Lastly, we present a framework called HarDNN that packages all these solutions for easy application.

[1]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[2]  Jianan Wang,et al.  Soft Error Resilience of Deep Residual Networks for Object Recognition , 2020, IEEE Access.

[3]  Carole-Jean Wu,et al.  DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[4]  K. Pattabiraman,et al.  BinFI: an efficient fault injector for safety-critical machine learning systems , 2019, SC.

[5]  Nathan DeBardeleben,et al.  Failure Analysis and Quantification for Contemporary and Future Supercomputers , 2019, ArXiv.

[6]  Yanxiang Huang,et al.  Resiliency of automotive object detection networks on GPU architectures , 2019, 2019 IEEE International Test Conference (ITC).

[7]  Gerd Ascheid,et al.  Automated design of error-resilient and hardware-efficient deep neural networks , 2019, Neural Computing and Applications.

[8]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[9]  Tudor Dumitras,et al.  Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural Networks Under Hardware Fault Attacks , 2019, USENIX Security Symposium.

[10]  Darko Marinov,et al.  Minotaur: Adapting Software Testing Techniques for Hardware Errors , 2019, ASPLOS.

[11]  Rajiv V. Joshi,et al.  Resilient Low Voltage Accelerators for High Energy Efficiency , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Franck Cappello,et al.  Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System , 2019, IEEE Transactions on Parallel and Distributed Systems.

[13]  Hongtu Zhu,et al.  Sensitivity Analysis of Deep Neural Networks , 2019, AAAI.

[14]  C. Frost,et al.  Selective Hardening for Neural Networks in FPGAs , 2019, IEEE Transactions on Nuclear Science.

[15]  Stephen W. Keckler,et al.  Optimizing Software-Directed Instruction Replication for GPU Error Detection , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Nathan DeBardeleben,et al.  Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Prabhat,et al.  Exascale Deep Learning for Climate Analytics , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Bin Nie,et al.  Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Charbel Sakr,et al.  An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Mattan Erez,et al.  Hamartia: A Fast and Accurate Error Injection Framework , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[21]  Karthik Pattabiraman,et al.  Modeling Soft-Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[22]  Gu-Yeon Wei,et al.  Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[23]  Gerd Ascheid,et al.  Accurate neuron resilience prediction for a flexible reliability management in neural network accelerators , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[25]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[27]  Bernard Girau,et al.  Fault and Error Tolerance in Neural Networks: A Review , 2017, IEEE Access.

[28]  Lav R. Varshney,et al.  Towards optimal quantization of neural networks , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[29]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[30]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[31]  Jia Deng,et al.  Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution , 2017, AAAI.

[32]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning , 2016, ArXiv.

[33]  Sarita V. Adve,et al.  Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Shidhartha Das,et al.  A Triple Core Lock-Step (TCLS) ARM® Cortex®-R5 Processor for Safety-Critical and Ultra-Reliable Applications , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[35]  Martin Schulz,et al.  IPAS: Intelligent protection against silent output corruption in scientific applications , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[36]  Jiaqi Liu,et al.  A Practical Approach for Handling Soft Errors in Iterative Applications , 2015, 2015 IEEE International Conference on Cluster Computing.

[37]  Karthik Pattabiraman,et al.  LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[38]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[39]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[40]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[41]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[42]  Sarita V. Adve,et al.  mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[44]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[45]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[46]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[47]  Ravishankar K. Iyer,et al.  Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware , 2006, 2006 Sixth European Dependable Computing Conference.

[48]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[49]  Matthew Woll,et al.  Standardization , 1928 .

[50]  日経BP社,et al.  Amazon Web Services完全ソリューションガイド , 2016 .

[51]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.