Understanding and Mitigating Hardware Failures in Deep Learning Training Systems

Deep neural network (DNN) training workloads are increasingly susceptible to hardware failures in datacenters. For example, Google experienced "mysterious, difficult to identify problems" in their TPU training systems due to hardware failures [7]. Although these particular problems were subsequently corrected through significant efforts, they have raised the urgency of addressing the growing challenges emerging from hardware failures impacting many DNN training workloads. In this paper, we present the first in-depth resilience study targeting DNN training workloads and hardware failures that occur in the logic portion of deep learning (DL) accelerator systems. We developed a fault injection framework to accurately simulate the effects of various hardware failures based on the design of an industrial DL accelerator, and conducted > 2.9M experiments (> 490K node-hours) using representative workloads. Based on our experiments, we present (1) a comprehensive characterization of hardware failure effects, (2) the fundamental understanding on how hardware failures propagate in training devices and interact with training workloads, and (3) the necessary conditions that must be satisfied for these failures to eventually cause unexpected training outcomes. The insights obtained from our study enabled us to develop ultralight-weight software techniques to mitigate hardware failures. Our techniques require 24--32 lines of code change, and introduce 0.003% -- 0.025% performance overhead for various representative workloads. Our observations and techniques are generally applicable to mitigate various hardware failures in DL training accelerator systems.

[1]  Harish Dattatraya Dixit,et al.  Detecting silent data corruptions in the wild , 2022, ArXiv.

[2]  Yunxiang Hu,et al.  A Survey on Convolutional Neural Network Accelerators: GPU, FPGA and ASIC , 2022, 2022 14th International Conference on Computer Research and Development (ICCRD).

[3]  Christopher W. Fletcher,et al.  Optimizing Selective Protection for CNN Resilience , 2021, 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE).

[4]  Jeremy Kepner,et al.  AI Accelerator Survey and Trends , 2021, 2021 IEEE High Performance Extreme Computing Conference (HPEC).

[5]  Izzeldin I. Mohd,et al.  Analyzing the Resilience of Convolutional Neural Networks Implemented on GPUs , 2021, International journal of electrical and computer engineering systems.

[6]  David E. Culler,et al.  Cores that don't count , 2021, HotOS.

[7]  Dimitris Gizopoulos,et al.  Demystifying the System Vulnerability Stack: Transient Fault Effects Across the Layers , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[8]  Peter C. Ma,et al.  Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[9]  Sriram Sankar,et al.  Silent Data Corruptions at Scale , 2021, ArXiv.

[10]  K. Simonyan,et al.  High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.

[11]  Li Chen,et al.  Soft errors in DNN accelerators: A comprehensive review , 2020 .

[12]  Alex Orailoglu,et al.  Just Say Zero: Containing Critical Bit-Error Propagation in Deep Neural Networks With Anomalous Feature Suppression , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[13]  Alex Orailoglu,et al.  Boosting Bit-Error Resilience of DNN Accelerators Through Median Feature Selection , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[14]  Yi He,et al.  FIdelity: Efficient Resilience Analysis Framework for Deep Learning Accelerators , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Philipp Hennig,et al.  Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers , 2020, ICML.

[16]  Zhiwei Steven Wu,et al.  Understanding Gradient Clipping in Private SGD: A Geometric Perspective , 2020, NeurIPS.

[17]  Stephen W. Keckler,et al.  Making Convolutions Resilient Via Algorithm-Based Error Detection Techniques , 2020, IEEE Transactions on Dependable and Secure Computing.

[18]  Ankit Singh Rawat,et al.  Can gradient clipping mitigate label noise? , 2020, ICLR.

[19]  K. Pattabiraman,et al.  A Low-cost Fault Corrector for Deep Neural Networks through Range Restriction , 2020, 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[20]  Zitao Chen,et al.  Ranger: Boosting Error Resilience of Deep Neural Networks through Range Restriction , 2020, ArXiv.

[21]  Franck Cappello,et al.  FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks , 2020, IEEE Transactions on Parallel and Distributed Systems.

[22]  Bor-Yiing Su,et al.  Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems , 2020, ArXiv.

[23]  Sparsh Mittal,et al.  A survey on modeling and improving reliability of DNN algorithms and accelerators , 2020, J. Syst. Archit..

[24]  Yiran Chen,et al.  A Survey of Accelerator Architectures for Deep Neural Networks , 2020 .

[25]  Muhammad Abdullah Hanif,et al.  FT-ClipAct: Resilience Analysis of Deep Neural Networks and Improving their Fault Tolerance using Clipped Activation , 2019, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[26]  Yanjing Li,et al.  Time-Slicing Soft Error Resilience in Microprocessors for Reliable and Energy-Efficient Execution , 2019, 2019 IEEE International Test Conference (ITC).

[27]  Lei Huang,et al.  Quantifying the Impact of Memory Errors in Deep Learning , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[28]  Matthew R. Walter,et al.  Multigrid Neural Memory , 2019, ICML.

[29]  Dimitris Gizopoulos,et al.  Demystifying Soft Error Assessment Strategies on ARM CPUs: Microarchitectural Fault Injection vs. Neutron Beam Experiments , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[30]  Suvrit Sra,et al.  Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.

[31]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[32]  Gerd Ascheid,et al.  An Efficient Bit-Flip Resilience Optimization Method for Deep Neural Networks , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[33]  Rajiv V. Joshi,et al.  Resilient Low Voltage Accelerators for High Energy Efficiency , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[34]  Stephen W. Keckler,et al.  Optimizing Software-Directed Instruction Replication for GPU Error Detection , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Nathan DeBardeleben,et al.  TensorFI: A Configurable Fault Injector for TensorFlow Applications , 2018, 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[36]  Gu-Yeon Wei,et al.  Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[37]  Joseph Redmon,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[38]  Gerd Ascheid,et al.  Accurate neuron resilience prediction for a flexible reliability management in neural network accelerators , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[39]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[40]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Subhasish Mitra,et al.  E-QED: Electrical Bug Localization During Post-silicon Validation Enabled by Quick Error Detection and Formal Methods , 2017, CAV.

[42]  Dimitris Gizopoulos,et al.  MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[44]  Gu-Yeon Wei,et al.  14.3 A 28nm SoC with a 1.2GHz 568nJ/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[45]  Sarita V. Adve,et al.  Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[46]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[48]  Eric Cheng,et al.  CLEAR: Cross-layer exploration for architecting resilience: Combining hardware and software techniques to tolerate soft errors in processor cores , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jacob A. Abraham,et al.  Efficient soft error vulnerability estimation of complex designs , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[51]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Farzan Fallah,et al.  Effective Post-Silicon Validation of System-on-Chips Using Quick Error Detection , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[54]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[55]  Karine Heydemann,et al.  Electromagnetic Fault Injection: Towards a Fault Model on a 32-bit Microcontroller , 2013, 2013 Workshop on Fault Diagnosis and Tolerance in Cryptography.

[56]  Muhammad Shafique,et al.  Exploiting program-level masking and error propagation for constrained reliability optimization , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[57]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[58]  Sarita V. Adve,et al.  Relyzer: Application Resiliency Analyzer for Transient Faults , 2013, IEEE Micro.

[59]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[60]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[61]  Jason Cong,et al.  Assuring application-level correctness against soft errors , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[62]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[63]  David Lin,et al.  QED: Quick Error Detection tests for effective post-silicon validation , 2010, 2010 IEEE International Test Conference.

[64]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[65]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[66]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[67]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[68]  David Blaauw,et al.  Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[69]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[70]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[71]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[72]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[73]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[74]  Michael N. Lovellette,et al.  Strategies for fault-tolerant, space-based computing: Lessons learned from the ARGOS testbed , 2002, Proceedings, IEEE Aerospace Conference.

[75]  Izzeldin Ibrahim Mohamed,et al.  A Selective Mitigation Technique of Soft Errors for DNN Models Used in Healthcare Applications: DenseNet201 Case Study , 2021, IEEE Access.

[76]  Abbreviazioni Periodici Giuridici N. D. I. , 2013 .

[77]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .