An adaptive write error detection technique in on-chip caches of multi-level caching systems

Abstract Cache memories are becoming an integral part of modern computer systems and are instrumented in various ways. As a result of the nature of reference locality, the CPU mostly communicates instructions and data with the first level on-chip caches that are originally fetched from the secondary cache or memory with very low frequency. Thus, the guarantee of this initial fetch-and-write into the first level cache, which is rare but fundamental for correct future operation, is indispensable for a dependable caching system. This paper presents a new cache write error detection scheme, called cache write sure (CWS), which exploits the preexisting information redundancy of the multi-level caching systems. The effectiveness of this detection technique is evaluated by using on-the-fly trace driven simulations of thirteen benchmarks combined with software error injection. The results show that for most workloads, the CWS provides almost complete write error detection for non-protected I-cache in a two-level on-chip caching system with a cache cycle time ratio between L1 and L2 of 1:5. At the same time, it can also cover 57.9% of write error for D-cache.

[1]  Steven A. Przybylski,et al.  Cache and memory hierarchy design: a performance-directed approach , 1990 .

[2]  Alan Eustace,et al.  ATOM - A System for Building Customized Program Analysis Tools , 1994, PLDI.

[3]  J. W. Bishop,et al.  PowerPC AS A10 64-bit RISC microprocessor , 1996, IBM J. Res. Dev..

[4]  Stuart J. Adams Hardware assisted recovery from transient errors in redundant processing systems , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Ruben W. Castelino,et al.  Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor , 1995, Digit. Tech. J..

[6]  Johan Karlsson,et al.  Using heavy-ion radiation to validate fault-handling mechanisms , 1994, IEEE Micro.

[7]  Gary Goldman,et al.  UltraSPARC-II: the advancement of ultracomputing , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[8]  Apostolos Dollas,et al.  Predicting and precluding problems with memory latency , 1994, IEEE Micro.

[9]  Michael J. Flynn,et al.  An area model for on-chip memories and its application , 1991 .

[10]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[11]  Daniel P. Siewiorek,et al.  Derivation and Calibration of a Transient Error Reliability Model , 1982, IEEE Transactions on Computers.

[12]  Jim Handy,et al.  The cache memory book , 1993 .

[13]  Janusz Sosnowski,et al.  Transient fault tolerance in digital systems , 1994, IEEE Micro.