Error Resilient Machine Learning for Safety-Critical Systems: Position Paper

Machine learning (ML) has increasingly been adopted in safety-critical systems such as autonomous vehicles (AVs) and industrial robotics. In these domains, reliability and safety are important considerations, and hence it is critical to ensure the resilience of ML systems to faults and errors. On the other hand, soft errors are becoming more frequent in commodity computer systems due to the effects of technology scaling and reduced supply voltages. Further, traditional solutions for masking hardware faults such as Triple-Modular Redundancy (TMR) are prohibitively expensive in terms of their energy and performance overheads. Therefore, there is a compelling need to ensure the resilience of ML applications to soft errors on commodity hardware platforms.We first experimentally assess the resilience of safety-critical ML applications to soft errors. We demonstrate through fault injection experiments that even a single bit flip due to a soft error can lead to misclassification in Deep Neural Network (DNN) applications deployed in AVs, leading to safety violations. However, not all the errors in an DNN will result in serve consequences such as safety violations, and hence it is sufficient to protect the DNN from the ones that do. Unfortunately, finding all possible errors that result in safety violations is a very compute intensive task. We propose BinFI, a fault injection approach that efficiently injects critical faults that are highly likely to result in safety violations, based on the unique properties of DNNs. Finally, we propose Ranger, an approach to protect DNNs from critical faults with minimal performance overheads and no accuracy loss. We will conclude by presenting some of our ongoing work, and the future challenges in this area.

[1]  Octavio Castillo Reyes,et al.  A Machine Learning Approach for Parameter Screening in Earthquake Simulation , 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[2]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[3]  Blaine Nelson,et al.  The security of machine learning , 2010, Machine Learning.

[4]  Karthik Pattabiraman,et al.  Modeling Soft-Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[5]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[6]  Johan Karlsson,et al.  One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[7]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[8]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Shuhei Yamashita,et al.  Introduction of ISO 26262 'Road vehicles-Functional safety' , 2012 .

[10]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[11]  Guanpeng Li,et al.  BinFI: an efficient fault injector for safety-critical machine learning systems , 2019, SC.

[12]  Meeta Sharma Gupta,et al.  SDCTune: A model for predicting the SDC proneness of an application for configurable protection , 2014, 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[13]  Jichao Zhao,et al.  Robust ECG signal classification for detection of atrial fibrillation using a novel neural network , 2017, 2017 Computing in Cardiology (CinC).

[14]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Hong-Jun Yoon,et al.  Multi-task Deep Neural Networks for Automated Extraction of Primary Site and Laterality Information from Cancer Pathology Reports , 2016, INNS Conference on Big Data.

[17]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[18]  Ilia Polian,et al.  Adaptive voltage over-scaling for resilient applications , 2011, 2011 Design, Automation & Test in Europe.

[19]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[20]  Zitao Chen,et al.  Ranger: Boosting Error Resilience of Deep Neural Networks through Range Restriction , 2020, ArXiv.

[21]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[22]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[23]  Vladimir Khryashchev,et al.  Using Convolutional Neural Networks in the Problem of Cell Nuclei Segmentation on Histological Images , 2019 .

[24]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[25]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[27]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[28]  Huiyang Zhou,et al.  In-Place Zero-Space Memory Protection for CNN , 2019, NeurIPS.

[29]  Mattan Erez,et al.  Evaluating and Accelerating High-Fidelity Error Injection for HPC , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Martin Schulz,et al.  REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Karthik Pattabiraman,et al.  Error detector placement for soft computation , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[32]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[33]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[34]  C. Constantinescu,et al.  Intermittent faults and effects on reliability of integrated circuits , 2008, 2008 Annual Reliability and Maintainability Symposium.

[35]  Wonyong Sung,et al.  Resiliency of Deep Neural Networks under Quantization , 2015, ArXiv.