Modeling and characterizing GPGPU reliability in the presence of soft errors

Abstract The general-purpose computing on graphic processing units (GPGPUs) becomes increasingly popular due to its high computational throughput for data parallel applications. Modern GPU architectures have limited capability for error detection and fault tolerance since they are originally designed for graphics processing. However, the rigorous execution correctness is required for general-purpose applications, which makes reliability a growing concern in the GPGPU architecture design. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper explores a first step to model and characterize GPGPU reliability in light of soft errors. We develop GPGPU-SODA (GPGPU SOftware Dependability Analysis), a framework to estimate the soft-error vulnerability of GPGPU microarchitecture. By using GPGPU-SODA, we observe that several microarchitecture structures in GPGPUs exhibit high soft-error susceptibility, and the structure vulnerability is sensitive to the workload characteristics (e.g. branch divergences, memory access pattern). We further investigate the impact of several architectural optimizations on GPU soft-error robustness. For example, we find that increasing the number of threads supported by GPU significantly affects the GPGPU soft-error robustness. However, changing the warp scheduling policy has little impact on the structure vulnerability. The observations made in this study provide designers the useful guidance to build resilient GPGPUs: a comprehensive resiliency solution for GPGPUs should consider the entire GPGPU design instead of solely focusing on a particular structure.

[1]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[2]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[3]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[4]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[5]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  N. H. Kim Effect of Instruction Fetch and Memory Scheduling on GPU Performance , 2009 .

[7]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[9]  Anand Sivasubramaniam,et al.  Mechanisms for bounding vulnerabilities of processor structures , 2007, ISCA '07.

[10]  Andrew E. Turner,et al.  Visualizing complex dynamics in many-core accelerator architectures , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[11]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[12]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[14]  Tao Li,et al.  Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[15]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[17]  Daniel J. Sorin,et al.  Argus-G: A Low-Cost Error Detection Scheme for GPGPUs , 2010 .

[18]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Sanjay J. Patel,et al.  Performance characterization of a hardware mechanism for dynamic optimization , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[20]  J. Fortes,et al.  Sim-SODA : A Unified Framework for Architectural Level Software Reliability Analysis , 2006 .

[21]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[22]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[23]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.