Optimizing Software-Directed Instruction Replication for GPU Error Detection
暂无分享,去创建一个
Stephen W. Keckler | Abdulrahman Mahmoud | Michael B. Sullivan | Timothy Tsai | Siva Kumar Sastry Hari | Michael B. Sullivan | S. Keckler | Timothy Tsai | Abdulrahman Mahmoud | S. Hari
[1] Cristian Constantinescu,et al. Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.
[2] Satoshi Matsuoka,et al. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[3] Scott A. Mahlke,et al. Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.
[4] Amin Ansari,et al. Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.
[5] Kevin Skadron,et al. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).
[6] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[7] Michael Sullivan,et al. CRUM: Checkpoint-Restart Support for CUDA's Unified Memory , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[8] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[9] Kevin Skadron,et al. Real-world design and evaluation of compiler-managed GPU redundant multithreading , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[10] Edward J. McCluskey,et al. Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..
[11] Huiyang Zhou,et al. Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.
[12] Matsuoka Satoshi,et al. MPI-CUDA Applications Checkpointing , 2010 .
[13] Daniel J. Sorin,et al. Argus-G: Comprehensive, Low-Cost Error Detection for GPGPU Cores , 2015, IEEE Computer Architecture Letters.
[14] Mattan Erez,et al. Hamartia: A Fast and Accurate Error Injection Framework , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).
[15] Joel S. Emer,et al. The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.
[16] J. Xu. OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .
[17] Xin Fu,et al. RISE: Improving the streaming processors reliability against soft errors in GPGPUs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[18] David I. August,et al. SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.
[19] Hyeran Jeon,et al. Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[20] Yi Yang,et al. Revisiting ILP Designs for Throughput-Oriented GPGPU Architecture , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[21] Philip Koopman,et al. Selection of Cyclic Redundancy Code and Checksum Algorithms to Ensure Critical Data Integrity , 2015 .
[22] Lisa Spainhower,et al. Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.
[23] Fernanda Gusmão de Lima Kastensmidt,et al. Implementation and experimental evaluation of a CUDA core under single event effects , 2014, 2014 15th Latin American Test Workshop - LATW.
[24] Edward J. McCluskey,et al. Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..
[25] Meeta Sharma Gupta,et al. Error Tolerance in Server Class Processors , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[26] David W. Nellans,et al. Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[27] Tipp Moseley,et al. Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[28] Luigi Carro,et al. GPGPUs ECC efficiency and efficacy , 2014, 2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).
[29] Hyeran Jeon,et al. Warped-RE: Low-Cost Error Detection and Correction in GPUs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[30] David I. August,et al. Software-controlled fault tolerance , 2005, TACO.
[31] M. Rimen,et al. Implicit signature checking , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[32] Cheng Wang,et al. Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).
[33] Josep Torrellas,et al. InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[34] Rajesh K. Gupta,et al. Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).
[35] Shubhendu S. Mukherjee,et al. Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[36] Shidhartha Das,et al. A Triple Core Lock-Step (TCLS) ARM® Cortex®-R5 Processor for Safety-Critical and Ultra-Reliable Applications , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).
[37] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.
[38] Keshav Pingali,et al. A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).
[39] Stephen W. Keckler,et al. SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[40] Tipp Moseley,et al. PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.