SwapCodes: Error Codes for Hardware-Software Cooperative GPU Pipeline Error Detection
暂无分享,去创建一个
Stephen W. Keckler | Brian Zimmer | Michael B. Sullivan | Timothy Tsai | Siva Kumar Sastry Hari | Michael B. Sullivan | S. Keckler | B. Zimmer | Timothy Tsai | S. Hari
[1] Scott A. Mahlke,et al. Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.
[2] Daniel J. Sorin,et al. Argus-G: Comprehensive, Low-Cost Error Detection for GPGPU Cores , 2015, IEEE Computer Architecture Letters.
[3] Jacob A. Abraham,et al. Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).
[4] James M. Caffrey. The resiliency challenge presented by soft failure incidents , 2008, IBM Syst. J..
[5] Bo Fang. Error Resilience Evaluation on GPGPU Applications , 2014 .
[6] T. R. N. Rao. Error-Checking Logic for Arithmetic-Type Operations of a Processor , 1968, IEEE Transactions on Computers.
[7] I. L. Sayers,et al. Implementation of 32-bit RISC processor incorporating hardware concurrent error detection and correction , 1990 .
[8] Mattan Erez,et al. Hamartia: A Fast and Accurate Error Injection Framework , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).
[9] Algirdas Avizienis,et al. Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design , 1971, IEEE Transactions on Computers.
[10] John F. Wakerly,et al. Error detecting codes, self-checking circuits and applications , 1978 .
[11] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[12] Thiago Santini,et al. Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units , 2016, IEEE Transactions on Computers.
[13] Jien-Chung Lo. Reliable Floating-Point Arithmetic Algorithms for Error-Coded Operands , 1994, IEEE Trans. Computers.
[14] Cheng Wang,et al. Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).
[15] E. G. Chester,et al. Design of a reliable and self-testing VLSI datapath using residue coding techniques , 1986 .
[16] W. F. Heida,et al. Towards a fault tolerant RISC-V softcore , 2016 .
[17] Tipp Moseley,et al. PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.
[18] Edward J. McCluskey,et al. Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..
[19] Luigi Carro,et al. GPGPUs ECC efficiency and efficacy , 2014, 2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).
[20] Jeffrey P. Kubala,et al. IBM System z10 design for RAS , 2009, IBM J. Res. Dev..
[21] Huiyang Zhou,et al. Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.
[22] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[23] Albert Meixner,et al. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2008, IEEE Micro.
[24] Rajesh K. Gupta,et al. Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).
[25] Algirdas Avizienis,et al. The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design , 1971, IEEE Transactions on Computers.
[26] Reto Zimmermann,et al. Efficient VLSI implementation of modulo (2/sup n//spl plusmn/1) addition and multiplication , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).
[27] M. Y. Hsiao,et al. Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress , 1981, IBM J. Res. Dev..
[28] William J. Dally,et al. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[29] Behrooz Parhami,et al. Computer arithmetic - algorithms and hardware designs , 1999 .
[30] David W. Nellans,et al. Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[31] Luigi Carro,et al. On the evaluation of soft-errors detection techniques for GPGPUs , 2013, 2013 8th IEEE Design and Test Symposium.
[32] 藤原 英二,et al. Code design for dependable systems : theory and practical applications , 2006 .
[33] Mohd Hafiz Sulaiman,et al. A survey of fault-tolerant processor based on error correction code , 2014, 2014 IEEE Student Conference on Research and Development.
[34] Stephen W. Keckler,et al. SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[35] Stanislaw J. Piestrak. Design of Residue Generators and Multioperand Modular Adders Using Carry-Save Adders , 1994, IEEE Trans. Computers.
[36] Dimitris Gizopoulos,et al. MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[37] M. Y. Hsiao,et al. A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .
[38] Jinsuk Chung,et al. Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.
[39] David I. August,et al. SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.
[40] Michael Nicolaidis,et al. Carry checking/parity prediction adders and ALUs , 2003, IEEE Trans. Very Large Scale Integr. Syst..
[41] Eric Schwarz,et al. Self Checking in Current Floating-Point Units , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.
[42] Hossam A. H. Fahmy,et al. Residue codes for error correction in a combined decimal/binary redundant floating point adder , 2012, 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).
[43] Kevin Skadron,et al. Real-world design and evaluation of compiler-managed GPU redundant multithreading , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[44] Eric Cheng,et al. System-Level Effects of Soft Errors in Uncore Components , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[45] Stephen W. Keckler,et al. Optimizing Software-Directed Instruction Replication for GPU Error Detection , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.