Workload-dependent relative fault sensitivity and error contribution factor of GPU onchip memory structures

GPU (Graphics Processing Unit) is emerging as an efficient and scalable accelerator for data-parallel workloads in various applications ranging from tablet PCs to HPC (High Performance Computing) mainframes. Unlike traditional 3D graphics rendering, general-purpose compute applications demand stringent assurance of reliability. Therefore, single error tolerance schemes such as SECDED (Single Error Correcting Double Error Detecting) code are being rapidly introduced to high-end GPUs targeting high-performance general-purpose computing. However, relative fault sensitivity and error contribution of critical on-chip memory structures such as active mask stack (AMS), register file (REG) and local memory (MEM) are yet to be studied. Also, implications of single error tolerance on various GPGPU (General Purpose computing on GPU) workloads have not been quantitatively analyzed to reveal its relative cost/fault-tolerance efficiency. To address this issue, a novel Monte Carlo simulation framework has been explored in this work to enumerate and analyze well-converged fault injection data. Instead of estimating AVF (Architectural Vulnerability Factor) of each structure individually, we have injected faults to a whole memory (AMS, REG and MEM combined) in a structure-oblivious fashion. Then, we further categorized and analyzed each structure's relative fault sensitivity and error contribution factor. Finally, we have studied implications of single error tolerance on the memory structures by further considering eight different possible ECC profiles. Results show that relative fault sensitivity and error contribution of REG is highest among the considered memory structures; therefore, ECC (Error Correction Code) protection of REG is most critical and cost-effective.

[1]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[2]  Mehdi Baradaran Tahoori,et al.  Reducing Data Cache Susceptibility to Soft Errors , 2006, IEEE Transactions on Dependable and Secure Computing.

[3]  Peter Hazucha,et al.  Characterization of soft errors caused by single event upsets in CMOS processes , 2004, IEEE Transactions on Dependable and Secure Computing.

[4]  J. Tschanz,et al.  Neutron soft error rate measurements in a 90-nm CMOS process and scaling trends in SRAM from 0.25-/spl mu/m to 90-nm generation , 2003, IEEE International Electron Devices Meeting 2003.

[5]  Michel Dubois,et al.  MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[6]  P. Oldiges,et al.  Multi-bit upsets in 65nm SOI SRAMs , 2008, 2008 IEEE International Reliability Physics Symposium.

[7]  Xin Fu,et al.  RISE: Improving the streaming processors reliability against soft errors in GPGPUs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Dan Alexandrescu,et al.  A Systematical Method of Quantifying SEU FIT , 2008, 2008 14th IEEE International On-Line Testing Symposium.

[9]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[10]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[11]  Rémi Bardenet,et al.  Monte Carlo Methods , 2013, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[12]  Satoshi Matsuoka,et al.  A high-performance fault-tolerant software framework for memory on commodity GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  Doe Hyun Yoon,et al.  Flexible and efficient reliability in memory systems , 2011 .

[14]  Pedro Reviriego,et al.  Efficient error detection codes for multiple-bit upset correction in SRAMs with BICS , 2009, TODE.

[15]  Jungwon Kim,et al.  OpenCL as a unified programming model for heterogeneous CPU/GPU clusters , 2012, PPoPP '12.

[16]  J. Hammersley,et al.  Monte Carlo Methods , 1965 .

[17]  Suge Yue,et al.  A Monte Carlo-based control signal generator for single event effetc (SEE) fault injection , 2009, 2009 European Conference on Radiation and Its Effects on Components and Systems.

[18]  Jong-Deok Choi,et al.  An OpenCL framework for heterogeneous multicores with local memory , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Chenxu Zhao,et al.  Fault samples simulation based on Monte Carlo method in testability virtual test , 2011, The Proceedings of 2011 9th International Conference on Reliability, Maintainability and Safety.

[21]  J. Draper,et al.  Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs , 2008, ESSCIRC 2008 - 34th European Solid-State Circuits Conference.

[22]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[23]  L. W. Massengill,et al.  Neutron- and alpha-particle induced soft-error rates for flip flops at a 40 nm technology node , 2011, 2011 International Reliability Physics Symposium.

[24]  Mehdi Baradaran Tahoori,et al.  A Fast Analytical Approach to Multi-cycle Soft Error Rate Estimation of Sequential Circuits , 2010, 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools.

[25]  Xin Fu,et al.  Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[26]  Shusuke Yoshimoto,et al.  Multiple-bit-upset and single-bit-upset resilient 8T SRAM bitcell layout with divided wordline structure , 2011, 2011 IEEE 17th International On-Line Testing Symposium.