ICR: in-cache replication for enhancing data cache reliability

Processor caches already play a critical role in the performance of today’s computer systems. At the same time, the data integrity of words coming out of the caches can have serious consequences on the ability of a program to execute correctly, or even to proceed. The integrity checks need to be performed in a time-sensitive manner to not slow down the execution when there are no errors as in the common case, and should not excessively increase the power budget of the caches which is already high. ECC and parity-based protection techniques in use today fall at either extremes in terms of compromising one criteria for another, i.e., reliability for performance or vice-versa. This paper proposes a novel solution to this problem by allowing in-cache replication, wherein reliability can be enhanced without excessively slowing down cache accesses or requiring significant area cost increases. The mechanism is fairly power efficient in comparison to other alternatives as well. In particular, the solution replicates data that is in active use within the cache itself while evicting those that may not be needed in the near future. Our experiments show that a large fraction of the data read from the cache have replicas available with this optimization.

[1]  Luca Benini,et al.  Low power error resilient encoding for on-chip data buses , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[2]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[3]  Arun K. Somani,et al.  An adaptive write error detection technique in on-chip caches of multi-level caching systems , 1999, Microprocess. Microsystems.

[4]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[5]  Kishor S. Trivedi,et al.  Performance And Reliability Analysis Of Computer Systems (an Example-based Approach Using The Sharpe Software , 1997, IEEE Transactions on Reliability.

[6]  Janusz Sosnowski,et al.  Transient fault tolerance in digital systems , 1994, IEEE Micro.

[7]  Bella Bose,et al.  Burst asymmetric/unidirectional error correcting/detecting codes , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[8]  Hideki Imai Essentials of Error-Control Coding Techniques , 1990 .

[9]  Michael J. Flynn,et al.  An area model for on-chip memories and its application , 1991 .

[10]  Kevin Reick,et al.  Power4 System Design for High Reliability , 2002, IEEE Micro.

[11]  Boudewijn R. Haverkort,et al.  Performance and reliability analysis of computer systems: An example-based approach using the sharpe software package , 1998 .

[12]  J. C. Pickel,et al.  Cosmic Ray Induced in MOS Memory Cells , 1978, IEEE Transactions on Nuclear Science.

[13]  Arun K. Somani,et al.  Area efficient architectures for information integrity in cache memories , 1999, ISCA.

[14]  Kishor S. Trivedi,et al.  A cache error propagation model , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[15]  Arun K. Somani,et al.  An Affordable Transient Fault Tolerance for Superscalar Processors , 2001 .

[16]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[17]  Kevin Skadron,et al.  Design issues and tradeoffs for write buffers , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[18]  Janak H. Patel,et al.  Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..

[19]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[20]  Johan Karlsson,et al.  Using heavy-ion radiation to validate fault-handling mechanisms , 1994, IEEE Micro.

[21]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[22]  K. Yelick,et al.  Intelligent RAM (IRAM): chips that remember and compute , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[23]  Eiji Fujiwara,et al.  A Class of Error Control Codes for Byte Organized Memory Systems -SbEC-(Sb+S)ED Codes- , 1997, IEEE Trans. Computers.

[24]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[25]  M. Martonosi,et al.  Timekeeping in the memory system: predicting and optimizing memory behavior , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[26]  Jih-Kwon Peir,et al.  Capturing dynamic memory reference behavior with adaptive cache topology , 1998, ASPLOS VIII.

[27]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.