Flexible ECC Management for Low-Cost Transient Error Protection of Last-Level Caches

The conventional error correcting code (ECC) schemes for caches are based on a fixed mapping between cache data words and ECC check bits, and fixed ECC word granularity. This leads to inefficient usage of the ECC check bits. We propose to manage the check bits flexibly for low-cost error protection of last-level caches. The proposed ECC schemes work at the word level, whereas the conventional ECC schemes work at the cache line or set level. The proposed schemes protect only dirty words with ECC check bits using a flexible mapping. Moreover, the proposed schemes utilize variable ECC word granularities. Dirty (modified) words that are unlikely to be modified further before being evicted are collectively protected with a larger ECC word granularity. The proposed schemes reduce DRAM and data bus energy overheads by 28% and 45%, respectively, with the same area overhead as previously proposed competitive schemes. Our schemes show more energy reduction results for multicore systems without noticeable performance degradation.

[1]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  Doe Hyun Yoon,et al.  Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.

[3]  Michel Dubois,et al.  CPPC: Correctable parity protected cache , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[4]  Daniel J. Sorin,et al.  Choosing an Error Protection Scheme for a Microprocessor's L1 Data Cache , 2006, 2006 International Conference on Computer Design.

[5]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[7]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[8]  Nhon Quach,et al.  High Availability and Reliability in the Itanium Processor , 2000, IEEE Micro.

[9]  Mahmut T. Kandemir,et al.  Soft error and energy consumption interactions: a data cache perspective , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[10]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[11]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[12]  Doe Hyun Yoon,et al.  Flexible cache error protection using an ECC FIFO , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[13]  Wei Zhang,et al.  Replication cache: a small fully associative cache to improve data cache reliability , 2005, IEEE Transactions on Computers.

[14]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[15]  R. Baumann The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction , 2002, Digest. International Electron Devices Meeting,.

[16]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[17]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[18]  Wei Zhang,et al.  Computing cache vulnerability to transient errors and its implication , 2005, 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'05).

[19]  T. Sugii,et al.  Impact of cosmic ray neutron induced soft errors on advanced submicron CMOS circuits , 1996, 1996 Symposium on VLSI Technology. Digest of Technical Papers.

[20]  Michel Dubois,et al.  MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[21]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[22]  Rajiv V. Joshi,et al.  A 2-ns cycle, 3.8-ns access 512-kb CMOS ECL SRAM with a fully pipelined architecture , 1991 .

[23]  Arun K. Somani,et al.  Area efficient architectures for information integrity in cache memories , 1999, ISCA.

[24]  N. Rydbeck,et al.  PCM/TDMA satellite communication systems with error correcting and error detecting codes , 1976 .

[25]  Wei Zhang,et al.  ICR: in-cache replication for enhancing data cache reliability , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[26]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[27]  G. Tyson,et al.  Eager writeback-a technique for improving bandwidth utilization , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[28]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[29]  Gabriel H. Loh,et al.  Zesto: A cycle-level simulator for highly detailed microarchitecture exploration , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[30]  Soontae Kim Reducing Area Overhead for Error-Protecting Large L2/L3 Caches , 2009, IEEE Trans. Computers.

[31]  Mehdi Baradaran Tahoori,et al.  Vulnerability Analysis of L2 Cache Elements to Single Event Upsets , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[32]  Norman P. Jouppi Cache write policies and performance , 1993, ISCA '93.

[33]  Guanghui Liu,et al.  ECC-Cache: A Novel Low Power Scheme to Protect Large-Capacity L2 Caches from Transiant Faults , 2009, 2009 Fifth International Conference on Information Assurance and Security.

[34]  J. Draper,et al.  Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs , 2008, ESSCIRC 2008 - 34th European Solid-State Circuits Conference.

[35]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[36]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[37]  B.C. Paul,et al.  Process variation in embedded memories: failure analysis and variation aware architecture , 2005, IEEE Journal of Solid-State Circuits.