Balancing Performance and Reliability in the Memory Hierarchy

Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability

[1]  Frank Vahid,et al.  A self-tuning cache architecture for embedded systems , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[2]  Arun K. Somani,et al.  Area efficient architectures for information integrity in cache memories , 1999, ISCA.

[3]  Michael L. Scott,et al.  Integrating adaptive on-chip storage structures for reduced dynamic power , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[4]  Gwan S. Choi,et al.  On-chip cache memory resilience , 1998, Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).

[5]  Y. Yagil,et al.  A systematic approach to SER estimation and solutions , 2003, 2003 IEEE International Reliability Physics Symposium Proceedings, 2003. 41st Annual..

[6]  Johan Karlsson,et al.  On latching probability of particle induced transients in combinational networks , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[7]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[8]  Rajeev Balasubramonian,et al.  Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.

[9]  Frank Vahid,et al.  A highly configurable cache architecture for embedded systems , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[10]  Brad Calder,et al.  Picking statistically valid and early simulation points , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[11]  Johan Karlsson,et al.  Using heavy-ion radiation to validate fault-handling mechanisms , 1994, IEEE Micro.

[12]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[13]  Jiri Gaisler Evaluation of a 32-bit microprocessor with built-in concurrent error-detection , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[14]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[15]  Tryggve Fossum,et al.  Cache scrubbing in microprocessors: myth or necessity? , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[16]  Massimo Violante,et al.  An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[17]  E. Normand Single event upset at ground level , 1996 .

[18]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[19]  Changhong Dai,et al.  Impact of CMOS process scaling and SOI on the soft error rates of logic processes , 2001, 2001 Symposium on VLSI Technology. Digest of Technical Papers (IEEE Cat. No.01 CH37184).

[20]  Lloyd W. Massengill,et al.  Impact of scaling on soft-error rates in commercial microprocessors , 2002 .

[21]  Barry W. Johnson Design & analysis of fault tolerant digital systems , 1988 .

[22]  Arun K. Somani,et al.  Soft error sensitivity characterization for microprocessor dependability enhancement strategy , 2002, Proceedings International Conference on Dependable Systems and Networks.

[23]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[24]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[25]  R. Velazco,et al.  Impact of data cache memory on the single event upset-induced error rate of microprocessors , 2003 .

[26]  K. Soumyanath,et al.  Scaling trends of cosmic ray induced soft errors in static latches beyond 0.18 /spl mu/ , 2001, 2001 Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.01CH37185).

[27]  Wei Zhang,et al.  ICR: in-cache replication for enhancing data cache reliability , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[28]  Mikko H. Lipasti,et al.  Silent stores for free , 2000, MICRO 33.

[29]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.