Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices

Technology scaling has continuously improved the density, performance, energy efficiency, and cost of DRAM-based main memory systems. Starting from sub-20nm processes, however, the industry began to pay considerably higher costs to screen and manage notably increasing defective cells. The traditional technique, which replaces the rows/columns containing faulty cells with spare rows/columns, has been able to cost-effectively repair the defective cells so far, but it will become unaffordable soon because an excessive number of spare rows/columns are required to manage the increasing number of defective cells. This necessitates a synergistic application of an alternative resilience technique such as In-DRAM ECC with the traditional one. Through extensive measurement and simulation, we first identify that aggressive miniaturization makes DRAM cells more sensitive to random telegraph noise or variable retention time, which is dominantly manifested as a surge in randomly scattered single-cell faults. Second, we advocate using In-DRAM ECC to overcome the DRAM scaling challenges and architect In-DRAM ECC to accomplish high area efficiency and minimal performance degradation. Moreover, we show that advancement in process technology reduces decoding/correction time to a small fraction of DRAM access time, and that the throughput penalty of a write operation due to an additional read for a parity update is mostly overcome by the multi-bank structure and long burst writes that span an entire In-DRAM ECC codeword. Lastly, we demonstrate that system reliability with modern rank-level ECC schemes such as single device data correction is further improved by hundred million times with the proposed In-DRAM ECC architecture.

[1]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[2]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[3]  John L. Henning SPEC CPU2006 memory footprint , 2007, CARN.

[4]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  John Sartori,et al.  High Performance, Energy Efficient Chipkill Correct Memory with Multidimensional Parity , 2013, IEEE Computer Architecture Letters.

[6]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[7]  Onur Mutlu,et al.  An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[8]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[10]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[11]  D. Yaney,et al.  A meta-stable leakage phenomenon in DRAM charge storage —Variable hold time , 1987, 1987 International Electron Devices Meeting.

[12]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[13]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[14]  Yo-Hwan Koh,et al.  A low power and highly reliable 400Mbps mobile DDR SDRAM with on-chip distributed ECC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[15]  P.K. Ko,et al.  Random telegraph noise of deep-submicrometer MOSFETs , 1990, IEEE Electron Device Letters.

[16]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[17]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[19]  Rodolfo Pellizzoni,et al.  PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms , 2014, 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[20]  Behrooz Parhami,et al.  Defect, Fault, Error,..., or Failure? , 1997 .

[21]  Norman P. Jouppi,et al.  LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[22]  Hyoung-Joo Kim,et al.  25.1 A 3.2Gb/s/pin 8Gb 1.0V LPDDR4 SDRAM with integrated ECC engine for sub-1V DRAM core operation , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[23]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[24]  Tao Zhang,et al.  CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[25]  Charles H. Stapper,et al.  Synergistic Fault-Tolerance for Memory Chips , 1992, IEEE Trans. Computers.

[26]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[27]  J. W. Park,et al.  DRAM variable retention time , 1992, 1992 International Technical Digest on Electron Devices Meeting.

[28]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[29]  Jeffrey T. Draper,et al.  DEC ECC design to improve memory reliability in Sub-100nm technologies , 2008, 2008 15th IEEE International Conference on Electronics, Circuits and Systems.

[30]  Sanguhn Cha,et al.  Single-Error-Correction and Double-Adjacent-Error-Correction Code for Simultaneous Testing of Data Bit and Check Bit Arrays in Memories , 2014, IEEE Transactions on Device and Materials Reliability.

[31]  藤原 英二,et al.  Code design for dependable systems : theory and practical applications , 2006 .

[32]  Thijs Krol Memory error detection and error correction , 1979 .

[33]  Masashi Horiguchi,et al.  Nanoscale Memory Repair , 2011, Integrated Circuits and Systems.

[34]  Jichuan Chang,et al.  BOOM: Enabling mobile memory based low-power server DIMMs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[35]  Moinuddin K. Qureshi,et al.  XED: Exposing On-Die Error Detection Information for Strong Memory Reliability , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[36]  Hongzhong Zheng,et al.  Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling , 2014 .

[37]  Robert H. Dennard,et al.  Challenges and future directions for the scaling of dynamic random-access memory (DRAM) , 2002, IBM J. Res. Dev..

[38]  Lizy Kurian John,et al.  Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  D. C. Bossen b-adjacent error correction , 1970 .

[40]  A. Suzuki,et al.  A 65nm low-power embedded DRAM with extended data-retention sleep mode , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[41]  Ki Tae Park,et al.  Automatic failure analysis system for high density DRAM , 1994, Proceedings., International Test Conference.

[42]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[43]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[44]  O Seongil,et al.  McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[45]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[46]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[47]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[48]  Sukhan Lee,et al.  CiDRA: A cache-inspired DRAM resilience architecture , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[49]  Young-Hyun Jun,et al.  A new column redundancy scheme for yield improvement of high speed DRAMs with multiple bit pre-fetch structure , 2001, 2001 Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.01CH37185).