CiDRA: A cache-inspired DRAM resilience architecture

Although aggressive technology scaling has allowed manufacturers to integrate Giga bits of cells into a cost-sensitive main memory DRAM device, these cells have become more defect-prone. With increased cell failure rates, conventional solutions such as populating spare DRAM rows and relying on error-correcting codes (ECCs) have shown limited success due to high area overhead, the latency penalties of data coding, and interference between ECC within a device (in-DRAM ECC) and other ECC across devices (rank-level ECC). In this paper, we propose CiDRA, a cache-inspired DRAM resilience architecture, which substantially reduces the area and latency overheads of correcting bit errors on random locations due to these faulty cells. We put a small SRAM cache within a DRAM device to replace accesses to the addresses including the faulty cells with ones that correspond to the cache data array. This CiDRA cache is paired with a Bloom filter to minimize the energy overhead of accessing the cache tags for every DRAM access and is also partitioned into small pieces, each being associated with the I/O pads for better area efficiency. Both the cache and DRAM banks are accessed in parallel while the banks are much slower. Consequently, the cache and filter are not in the critical path for normal DRAM accesses and incur no latency overhead. We also enhance the traditional in-DRAM ECC with error position bits and the appropriate error detecting capability while preventing interference with the traditional rank-level ECC scheme. By combining this enhanced in-DRAM ECC with the cache and Bloom filter, CiDRA becomes more area efficient because the in-DRAM ECC corrects most bit errors that are sporadic while the cache deals with the remaining relatively few pathological cases.

[1]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[2]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  M. A. Lucente,et al.  Memory system reliability improvement through associative cache redundancy , 1990, IEEE Proceedings of the Custom Integrated Circuits Conference.

[4]  Chaitali Chakrabarti,et al.  Flexible product code-based ECC schemes for MLC NAND Flash memories , 2011, 2011 IEEE Workshop on Signal Processing Systems (SiPS).

[5]  Feng Lin,et al.  DRAM Circuit Design: Fundamental and High-Speed Topics , 2007 .

[6]  O Seongil,et al.  McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  H. Fujisawa,et al.  A multi-gigabit DRAM technology with 6F/sup 2/ open-bit-line cell distributed over-driven sensing and stacked-flash fuse , 2001, 2001 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC (Cat. No.01CH37177).

[8]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[10]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[11]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  T. Arslan,et al.  Efficient Error Correcting Codes for On-Chip DRAM Applications for Space Missions , 2005, 2005 IEEE Aerospace Conference.

[13]  Young-Hyun Jun,et al.  A new column redundancy scheme for yield improvement of high speed DRAMs with multiple bit pre-fetch structure , 2001, 2001 Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.01CH37185).

[14]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[15]  Sungjoo Hong,et al.  Memory technology trend and future challenges , 2010, 2010 International Electron Devices Meeting.

[16]  Howard Leo Kalter,et al.  A 50-ns 16-Mb DRAM with a 10-ns data rate and on-chip ECC , 1990 .

[17]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[18]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  J. Draper,et al.  Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs , 2008, ESSCIRC 2008 - 34th European Solid-State Circuits Conference.

[20]  Mark Horowitz,et al.  Rethinking DRAM Power Modes for Energy Proportionality , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  Ian T. Foster,et al.  A distributed look-up architecture for text mining applications using MapReduce , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[23]  Hongzhong Zheng,et al.  Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling , 2014 .

[24]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[25]  Jung Ho Ahn,et al.  The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing , 2013, TACO.

[26]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[27]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[28]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[29]  O Seongil,et al.  CIDR: A Cache Inspired Area-Efficient DRAM Resilience Architecture against Permanent Faults , 2015, IEEE Computer Architecture Letters.

[30]  Chia-Lin Yang,et al.  SECRET: Selective error correction for refresh energy reduction in DRAMs , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[31]  Thijs Krol Memory error detection and error correction , 1979 .

[32]  Taejoon Park,et al.  Analyzing the Impact of Joint Optimization of Cell Size, Redundancy, and ECC on Low-Voltage SRAM Array Total Area , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[33]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[34]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[35]  Robert S. Schriebman Error Correcting Code , 2006 .

[36]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[37]  Y. Liu,et al.  Anti-fuse memory array embedded in 14nm FinFET CMOS with novel selector-less bit-cell featuring self-rectifying characteristics , 2014, 2014 Symposium on VLSI Technology (VLSI-Technology): Digest of Technical Papers.

[38]  K. Arimoto,et al.  A built-in Hamming code ECC circuit for DRAMs , 1989 .

[39]  Ki Tae Park,et al.  Automatic failure analysis system for high density DRAM , 1994, Proceedings., International Test Conference.

[40]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[41]  Jose Renau,et al.  Effective Optimistic-Checker Tandem Core Design through Architectural Pruning , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[42]  Y. Mori,et al.  The origin of variable retention time in DRAM , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[43]  Simha Sethumadhavan,et al.  Scalable hardware memory disambiguation for high-ILP processors , 2003, IEEE Micro.

[44]  Norman P. Jouppi,et al.  LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[45]  O Seongil,et al.  Reducing memory access latency with asymmetric DRAM bank organizations , 2013, ISCA.

[46]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[47]  Masaki Tsukude,et al.  A speed-enhanced DRAM array architecture with embedded ECC , 1990 .

[48]  Moinuddin K. Qureshi Pay-As-You-Go: Low-overhead hard-error correction for phase change memories , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).