Replication cache: a small fully associative cache to improve data cache reliability

Soft error conscious cache design has become increasingly crucial for reliable computing. The widely used ECC or parity-based integrity checking techniques have only limited capability in error detection and correction, while incurring nontrivial penalty in area or performance. The N modular redundancy (NMR) scheme is too costly for processors with stringent cost constraints. This paper proposes a cost-effective solution to enhance data reliability significantly with minimum impact on performance. The idea is to add a small fully associative cache to store the replica of every write to the L1 data cache. Due to data locality and its full associativity, the replication cache can be kept small while providing replicas for a significant fraction of read hits in L1, which can be used to enhance data integrity against soft errors. Our experiments show that a replication cache with eight blocks can provide replicas for 97.3 percent of read hits in L1 on average. Moreover, compared with the recently proposed in-cache replication schemes, the replication cache is more energy efficient, while improving the data integrity against soft errors significantly.

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[3]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[4]  Narayanan Vijaykrishnan,et al.  Analyzing soft errors in leakage optimized SRAM design , 2003, 16th International Conference on VLSI Design, 2003. Proceedings..

[5]  Bella Bose,et al.  Burst asymmetric/unidirectional error correcting/detecting codes , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[6]  Wei Zhang,et al.  Enhancing data cache reliability by the addition of a small fully-associative replication cache , 2004, ICS '04.

[7]  Pong P. Chu,et al.  Write buffer design for on-chip cache , 1994, Proceedings 1994 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[8]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[9]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[10]  Wei Zhang,et al.  ICR: in-cache replication for enhancing data cache reliability , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[11]  Norman P. Jouppi Cache write policies and performance , 1993, ISCA '93.

[12]  Janusz Sosnowski,et al.  Transient fault tolerance in digital systems , 1994, IEEE Micro.

[13]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[14]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Eiji Fujiwara,et al.  A Class of Error Control Codes for Byte Organized Memory Systems -SbEC-(Sb+S)ED Codes- , 1997, IEEE Trans. Computers.

[16]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[17]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[18]  Arun K. Somani,et al.  Area efficient architectures for information integrity in cache memories , 1999, ISCA.

[19]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[20]  Johan Karlsson,et al.  Using heavy-ion radiation to validate fault-handling mechanisms , 1994, IEEE Micro.

[21]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[22]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[23]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[24]  L. ChenC.,et al.  Error-correcting codes for semiconductor memory applications , 1984 .

[25]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[26]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[27]  Hideki Imai Essentials of Error-Control Coding Techniques , 1990 .