Fault Injectors for TensorFlow: Evaluation of the Impact of Random Hardware Faults on Deep CNNs

Today, Deep Learning (DL) enhances almost every industrial sector, including safety-critical areas. The next generation of safety standards will define appropriate verification techniques for DL-based applications and propose adequate fault tolerance mechanisms. DL-based applications, like any other software, are susceptible to common random hardware faults such as bit flips, which occur in RAM and CPU registers. Such faults can lead to silent data corruption. Therefore, it is crucial to develop methods and tools that help to evaluate how DL components operate under the presence of such faults. In this paper, we introduce two new Fault Injection (FI) frameworks InjectTF and InjectTF2 for TensorFlow 1 and TensorFlow 2, respectively. Both frameworks are available on GitHub and allow the configurable injection of random faults into Neural Networks (NN). In order to demonstrate the feasibility of the frameworks, we also present the results of FI experiments conducted on four VGG-based Convolutional NNs using two image sets. The results demonstrate how random bit flips in the output of particular mathematical operations and layers of NNs affect the classification accuracy. These results help to identify the most critical operations and layers, compare the reliability characteristics of functionally similar NNs, and introduce selective fault tolerance mechanisms.

[1]  Fadi J. Kurdahi,et al.  Are CNNs Reliable Enough for Critical Applications? An Exploratory Study , 2020, IEEE Design & Test.

[2]  Luigi Carro,et al.  Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs , 2019, IEEE Transactions on Reliability.

[3]  Dhananjay S. Phatak,et al.  Fault tolerance of feedforward neural nets for classification tasks , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[4]  Gu-Yeon Wei,et al.  Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Daniel L. Palumbo,et al.  Performance and fault-tolerance of neural networks for optimization , 1993, IEEE Trans. Neural Networks.

[7]  Klaus Janschek,et al.  Quantification of the Impact of Random Hardware Faults on Safety-Critical AI Applications: CNN-Based Traffic Sign Recognition Case Study , 2019, 2019 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[8]  Johannes Stallkamp,et al.  Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition , 2012, Neural Networks.

[9]  Dhananjay S. Phatak,et al.  Complete and partial fault tolerance of feedforward neural nets , 1995, IEEE Trans. Neural Networks.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Nathan DeBardeleben,et al.  TensorFI: A Configurable Fault Injector for TensorFlow Applications , 2018, 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[12]  Gerd Ascheid,et al.  Efficient On-Line Error Detection and Mitigation for Deep Neural Network Accelerators , 2018, SAFECOMP.