Understanding and Mitigating Hardware Failures in Deep Learning Training Systems
暂无分享,去创建一个
[1] Harish Dattatraya Dixit,et al. Detecting silent data corruptions in the wild , 2022, ArXiv.
[2] Yunxiang Hu,et al. A Survey on Convolutional Neural Network Accelerators: GPU, FPGA and ASIC , 2022, 2022 14th International Conference on Computer Research and Development (ICCRD).
[3] Christopher W. Fletcher,et al. Optimizing Selective Protection for CNN Resilience , 2021, 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE).
[4] Jeremy Kepner,et al. AI Accelerator Survey and Trends , 2021, 2021 IEEE High Performance Extreme Computing Conference (HPEC).
[5] Izzeldin I. Mohd,et al. Analyzing the Resilience of Convolutional Neural Networks Implemented on GPUs , 2021, International journal of electrical and computer engineering systems.
[6] David E. Culler,et al. Cores that don't count , 2021, HotOS.
[7] Dimitris Gizopoulos,et al. Demystifying the System Vulnerability Stack: Transient Fault Effects Across the Layers , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).
[8] Peter C. Ma,et al. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).
[9] Sriram Sankar,et al. Silent Data Corruptions at Scale , 2021, ArXiv.
[10] K. Simonyan,et al. High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.
[11] Li Chen,et al. Soft errors in DNN accelerators: A comprehensive review , 2020 .
[12] Alex Orailoglu,et al. Just Say Zero: Containing Critical Bit-Error Propagation in Deep Neural Networks With Anomalous Feature Suppression , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).
[13] Alex Orailoglu,et al. Boosting Bit-Error Resilience of DNN Accelerators Through Median Feature Selection , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[14] Yi He,et al. FIdelity: Efficient Resilience Analysis Framework for Deep Learning Accelerators , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[15] Philipp Hennig,et al. Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers , 2020, ICML.
[16] Zhiwei Steven Wu,et al. Understanding Gradient Clipping in Private SGD: A Geometric Perspective , 2020, NeurIPS.
[17] Stephen W. Keckler,et al. Making Convolutions Resilient Via Algorithm-Based Error Detection Techniques , 2020, IEEE Transactions on Dependable and Secure Computing.
[18] Ankit Singh Rawat,et al. Can gradient clipping mitigate label noise? , 2020, ICLR.
[19] K. Pattabiraman,et al. A Low-cost Fault Corrector for Deep Neural Networks through Range Restriction , 2020, 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[20] Zitao Chen,et al. Ranger: Boosting Error Resilience of Deep Neural Networks through Range Restriction , 2020, ArXiv.
[21] Franck Cappello,et al. FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks , 2020, IEEE Transactions on Parallel and Distributed Systems.
[22] Bor-Yiing Su,et al. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems , 2020, ArXiv.
[23] Sparsh Mittal,et al. A survey on modeling and improving reliability of DNN algorithms and accelerators , 2020, J. Syst. Archit..
[24] Yiran Chen,et al. A Survey of Accelerator Architectures for Deep Neural Networks , 2020 .
[25] Muhammad Abdullah Hanif,et al. FT-ClipAct: Resilience Analysis of Deep Neural Networks and Improving their Fault Tolerance using Clipped Activation , 2019, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[26] Yanjing Li,et al. Time-Slicing Soft Error Resilience in Microprocessors for Reliable and Energy-Efficient Execution , 2019, 2019 IEEE International Test Conference (ITC).
[27] Lei Huang,et al. Quantifying the Impact of Memory Errors in Deep Learning , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).
[28] Matthew R. Walter,et al. Multigrid Neural Memory , 2019, ICML.
[29] Dimitris Gizopoulos,et al. Demystifying Soft Error Assessment Strategies on ARM CPUs: Microarchitectural Fault Injection vs. Neutron Beam Experiments , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[30] Suvrit Sra,et al. Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.
[31] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[32] Gerd Ascheid,et al. An Efficient Bit-Flip Resilience Optimization Method for Deep Neural Networks , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[33] Rajiv V. Joshi,et al. Resilient Low Voltage Accelerators for High Energy Efficiency , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[34] Stephen W. Keckler,et al. Optimizing Software-Directed Instruction Replication for GPU Error Detection , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[35] Nathan DeBardeleben,et al. TensorFI: A Configurable Fault Injector for TensorFlow Applications , 2018, 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).
[36] Gu-Yeon Wei,et al. Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).
[37] Joseph Redmon,et al. YOLOv3: An Incremental Improvement , 2018, ArXiv.
[38] Gerd Ascheid,et al. Accurate neuron resilience prediction for a flexible reliability management in neural network accelerators , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[39] David M. Brooks,et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[40] Guanpeng Li,et al. Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[41] Subhasish Mitra,et al. E-QED: Electrical Bug Localization During Post-silicon Validation Enabled by Quick Error Detection and Formal Methods , 2017, CAV.
[42] Dimitris Gizopoulos,et al. MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[43] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[44] Gu-Yeon Wei,et al. 14.3 A 28nm SoC with a 1.2GHz 568nJ/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).
[45] Sarita V. Adve,et al. Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[46] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[48] Eric Cheng,et al. CLEAR: Cross-layer exploration for architecting resilience: Combining hardware and software techniques to tolerate soft errors in processor cores , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).
[49] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Jacob A. Abraham,et al. Efficient soft error vulnerability estimation of complex designs , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[51] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[52] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Farzan Fallah,et al. Effective Post-Silicon Validation of System-on-Chips Using Quick Error Detection , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[54] Philipp Koehn,et al. Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.
[55] Karine Heydemann,et al. Electromagnetic Fault Injection: Towards a Fault Model on a 32-bit Microcontroller , 2013, 2013 Workshop on Fault Diagnosis and Tolerance in Cryptography.
[56] Muhammad Shafique,et al. Exploiting program-level masking and error propagation for constrained reliability optimization , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).
[57] Jacob A. Abraham,et al. Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).
[58] Sarita V. Adve,et al. Relyzer: Application Resiliency Analyzer for Transient Faults , 2013, IEEE Micro.
[59] Sarita V. Adve,et al. Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[60] Sarita V. Adve,et al. Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.
[61] Jason Cong,et al. Assuring application-level correctness against soft errors , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[62] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.
[63] David Lin,et al. QED: Quick Error Detection tests for effective post-silicon validation , 2010, 2010 IEEE International Test Conference.
[64] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[65] Amin Ansari,et al. Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.
[66] Sarita V. Adve,et al. Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).
[67] Sarita V. Adve,et al. Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.
[68] David Blaauw,et al. Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.
[69] Albert Meixner,et al. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[70] Sanjay J. Patel,et al. ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[71] David I. August,et al. SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.
[72] Edward J. McCluskey,et al. Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..
[73] Edward J. McCluskey,et al. Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..
[74] Michael N. Lovellette,et al. Strategies for fault-tolerant, space-based computing: Lessons learned from the ARGOS testbed , 2002, Proceedings, IEEE Aerospace Conference.
[75] Izzeldin Ibrahim Mohamed,et al. A Selective Mitigation Technique of Soft Errors for DNN Models Used in Healthcare Applications: DenseNet201 Case Study , 2021, IEEE Access.
[76] Abbreviazioni Periodici Giuridici. N. D. I. , 2013 .
[77] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .