Characterizing Deep Learning Neural Network Failures Between Algorithmic Inaccuracy and Transient Hardware Faults

Deep Neural Networks (DNNs) have been widely deployed in safety-critical applications such as autonomous vehicles, healthcare, and space applications. Though DNN models have long suffered intrinsic algorithmic inaccuracies, the increasing number of hardware transient faults in computer systems has been raising safety and reliability concerns in safety-critical applications. This paper investigates the impact of DNN misclassifications that caused by hardware transient faults and intrinsic algorithmic inaccuracy in safety-critical applications. We first extend a state-of-the-art fault injector for TensorFlow application, TensorFI, to support fault injections on modern DNN models in a scalable way, then characterize the outcome classes of the models, analyzing them based on safety related metrics. Finally, we conduct a large-scale fault injection experiment to measure the failures according to the metrics and study their impact on safety. We observe that failures caused by hardware transient faults could have much more significant impact (up to 4 times higher probability) on safety-critical applications than that of the DNN algorithmic inaccuracies, advocating the potential needs to protect DNNs from hardware faults in safety-critical applications.

[1]  Christopher W. Fletcher,et al.  Optimizing Selective Protection for CNN Resilience , 2021, 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE).

[2]  U. Rajendra Acharya,et al.  Automated detection of COVID-19 cases using deep neural networks with X-ray images , 2020, Computers in Biology and Medicine.

[3]  Zitao Chen,et al.  TensorFI: A Flexible Fault Injection Framework for TensorFlow Applications , 2020, 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE).

[4]  Vivek Kothari,et al.  The Final Frontier: Deep Learning in Space , 2020, HotMobile.

[5]  K. Pattabiraman,et al.  BinFI: an efficient fault injector for safety-critical machine learning systems , 2019, SC.

[6]  Yanzhi Wang,et al.  Evaluating Fault Resiliency of Compressed Deep Neural Networks , 2019, 2019 IEEE International Conference on Embedded Software and Systems (ICESS).

[7]  Paolo Rech,et al.  Reliability Evaluation of Mixed-Precision Architectures , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8]  Mattan Erez,et al.  Evaluating and Accelerating High-Fidelity Error Injection for HPC , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Nathan DeBardeleben,et al.  TensorFI: A Configurable Fault Injector for TensorFlow Applications , 2018, 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[10]  Homa Alemzadeh,et al.  Experimental Resilience Assessment of an Open-Source Driving Agent , 2018, 2018 IEEE 23rd Pacific Rim International Symposium on Dependable Computing (PRDC).

[11]  Karthik Pattabiraman,et al.  Modeling Soft-Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[12]  Gu-Yeon Wei,et al.  Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[13]  Ravishankar K. Iyer,et al.  AVFI: Fault Injection for Autonomous Vehicles , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[14]  Lei Ma,et al.  DeepMutation: Mutation Testing of Deep Learning Systems , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[15]  Bryan Reimer,et al.  MIT Advanced Vehicle Technology Study: Large-Scale Naturalistic Driving Study of Driver Behavior and Interaction With Automation , 2017, IEEE Access.

[16]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Jichao Zhao,et al.  Robust ECG signal classification for detection of atrial fibrillation using a novel neural network , 2017, 2017 Computing in Cardiology (CinC).

[18]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[19]  Andrew Y. Ng,et al.  Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks , 2017, ArXiv.

[20]  Johan Karlsson,et al.  One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[21]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[22]  Luigi Carro,et al.  Evaluation of Histogram of Oriented Gradients Soft Errors Criticality for Automotive Applications , 2016, ACM Trans. Archit. Code Optim..

[23]  Hong-Jun Yoon,et al.  Multi-task Deep Neural Networks for Automated Extraction of Primary Site and Laterality Information from Cancer Pathology Reports , 2016, INNS Conference on Big Data.

[24]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[25]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[26]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Karthik Pattabiraman,et al.  Fine-Grained Characterization of Faults Causing Long Latency Crashes in Programs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[30]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[31]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[32]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[33]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[34]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[35]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[36]  Johannes Stallkamp,et al.  Detection of traffic signs in real-world images: The German traffic sign detection benchmark , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[37]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[38]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[39]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[40]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[41]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  C. Constantinescu,et al.  Intermittent faults and effects on reliability of integrated circuits , 2008, 2008 Annual Reliability and Maintainability Symposium.

[43]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[44]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[45]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[46]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .