FAULTSIM : A fast , configurable memory-resilience simulator

Recent studies of DRAM failures in data centers and supercomputer environments have highlighted non-uniform failure modes in DRAM chips. Failures fall into different classes depending on the source of the failure (e.g., an I/O pin, rank, bank, row, column, or bit). These failures will be common in future memory technologies. To mitigate them, memory systems employ complex error correcting codes and fault repair mechanisms. One way to evaluate the relative potency of these mechanisms is with analytical models, which are time-consuming to derive. Therefore, we propose FaultSim, a configurable memory-reliability simulation tool for 2D and 3D-stacked memories. FaultSim uses Monte Carlo methods, real-world failure statistics and novel algorithms to accelerate evaluation of different resilience schemes. Using multi-granularity failure rates from field studies with BCH-1 (SECDED) and ChipKill codes, simulated results are within 0.41% and 1.13% of an approximate analytical model, respectively.

[1]  Rakesh Kumar,et al.  Analyzing Reliability of Memory Sub-systems with Double-Chipkill Detect/Correct , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[2]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[4]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[6]  Alexander Thomasian,et al.  RAID5 Performance with Distributed Sparing , 1997, IEEE Trans. Parallel Distributed Syst..

[7]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[8]  Michael W. Riley,et al.  Determination of Reliability Using Event-Based Monte Carlo Simulation , 1975, IEEE Transactions on Reliability.

[9]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[10]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .