Modeling soft errors for data caches and alleviating their effects on data reliability

Soft errors caused by strikes arising from energetic particles pose a significant reliability concern for computing systems. In this study, we first introduce a model for soft error occurrence and propagation in cache memories. Based on this model, we define a metric called Architectural Vulnerability Factor for Caches (AVFC), which represents the probability with which a fault in the cache can be visible in the final output of the program. We then propose three architectural schemes for improving reliability. Our first scheme prevents an error from propagating to the lower levels in the memory hierarchy by not forwarding the unmodified data words of dirty cache blocks to the L2 cache at write-backs. The second scheme selectively invalidates cache blocks to reduce their vulnerable periods. To reduce the performance overhead caused by block invalidation, our third scheme tries to bring a fresh copy of the invalidated block into the cache via prefetching. The experimental results for the SPEC2000 suite show that, based on the proposed model, our first and third schemes together can improve the data reliability roughly 96% at the cost of less than 1% overhead in execution time, quite more than data improvements achieved by either two well-known techniques, namely write-through and early write-back cache mechanisms.

[1]  Shuai Wang,et al.  On the Characterization of Data Cache Vulnerability in High-Performance Embedded Microprocessors , 2006, 2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[2]  Osman S. Unsal,et al.  Exploiting Narrow Values for Soft Error Tolerance , 2006, IEEE Computer Architecture Letters.

[3]  Mahmut T. Kandemir,et al.  Soft errors issues in low-power caches , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[4]  Nhon Quach,et al.  High Availability and Reliability in the Itanium Processor , 2000, IEEE Micro.

[5]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[6]  Carl Carmichael,et al.  Triple Module Redundancy Design Techniques for Virtex FPGAs, Application Note 197 , 2001 .

[7]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[8]  T. Calin,et al.  Upset hardened memory design for submicron CMOS technology , 1996 .

[9]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[11]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  C.W. Slayman,et al.  Cache and memory error detection, correction, and reduction techniques for terrestrial servers and workstations , 2005, IEEE Transactions on Device and Materials Reliability.

[13]  K. Soumyanath,et al.  Scaling trends of cosmic ray induced soft errors in static latches beyond 0.18 /spl mu/ , 2001, 2001 Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.01CH37185).

[14]  Peter Hazucha,et al.  Characterization of soft errors caused by single event upsets in CMOS processes , 2004, IEEE Transactions on Dependable and Secure Computing.

[15]  Y. Yagil,et al.  A systematic approach to SER estimation and solutions , 2003, 2003 IEEE International Reliability Physics Symposium Proceedings, 2003. 41st Annual..

[17]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[18]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[19]  Babak Falsafi,et al.  Mitigating multi-bit soft errors in L1 caches using last-store prediction , 2007 .

[20]  Xiaodong Li,et al.  SoftArch: an architecture-level tool for modeling and analyzing soft errors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[21]  Arun K. Somani,et al.  Soft error sensitivity characterization for microprocessor dependability enhancement strategy , 2002, Proceedings International Conference on Dependable Systems and Networks.

[22]  A. H. Johnston,et al.  Single-event upset in commercial silicon-on-insulator PowerPC microprocessors , 2002, 2002 IEEE International SOI Conference.

[23]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[24]  Mahmut T. Kandemir,et al.  Modeling and improving data cache reliability: 1 , 2007, SIGMETRICS '07.

[25]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[26]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[27]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[28]  T. N. Vijaykumar,et al.  Opportunistic Transient-Fault Detection , 2006, IEEE Micro.

[29]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[30]  Aneesh Aggarwal,et al.  Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[31]  K. Osada,et al.  SRAM immunity to cosmic-ray-induced multierrors based on analysis of an induced parasitic bipolar effect , 2004, IEEE Journal of Solid-State Circuits.

[32]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[33]  E. Cannon,et al.  SRAM SER in 90, 130 and 180 nm bulk and SOI technologies , 2004, 2004 IEEE International Reliability Physics Symposium. Proceedings.

[34]  G. Tyson,et al.  Eager writeback-a technique for improving bandwidth utilization , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[35]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[36]  Mehdi Baradaran Tahoori,et al.  Reducing Data Cache Susceptibility to Soft Errors , 2006, IEEE Transactions on Dependable and Secure Computing.

[37]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[38]  M. Calvet,et al.  Simulation of nucleon-induced nuclear reactions in a simplified SRAM structure: scaling effects on SEU and MBU cross sections , 2001 .

[39]  James F. Ziegler,et al.  Terrestrial cosmic rays , 1996, IBM J. Res. Dev..

[40]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[41]  Mahmut T. Kandemir,et al.  Soft error and energy consumption interactions: a data cache perspective , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).