PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM

System-level detection and mitigation of DRAM failures offer a variety of system enhancements, such as better reliability, scalability, energy, and performance. Unfortunately, system-level detection is challenging for DRAM failures that depend on the data content of neighboring cells (data-dependent failures). DRAM vendors internally scramble/remap the system-level address space. Therefore, testing data-dependent failures using neighboring system-level addresses does not actually test the cells that are physically adjacent. In this work, we argue that one promising way to uncover data-dependent failures in the system is to determine the location of physically neighboring cells in the system address space. Unfortunately, if done naively, such a test takes 49 days to detect neighboring addresses even in a single memory row, making it infeasible in real systems. We develop PARBOR, an efficient system-level technique that determines the locations of the physically neighboring DRAM cells in the system address space and uses this information to detect data-dependent failures. To our knowledge, this is the first work that solves the challenge of detecting data-dependent failures in DRAM in the presence of DRAM-internal scrambling of system-level addresses. We experimentally demonstrate the effectiveness of PARBOR using 144 real DRAM chips from three major vendors. Our experimental evaluation shows that PARBOR 1) detects neighboring cell locations with only 66-90 tests, a 745,654X reduction compared to the naive test, and 2) uncovers 21.9% more failures compared to a random-pattern test that is unaware of the neighbor cell locations. We introduce a new mechanism that utilizes PARBOR to reduce refresh rate based on the data content of memory locations, thereby improving system performance and efficiency. We hope that our fast and efficient system-level detection technique enables other new ideas and mechanisms that improve the reliability, performance, and energy efficiency of DRAM-based memory systems.

[1]  R. Baumann The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction , 2002, Digest. International Electron Devices Meeting,.

[2]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[3]  Y. Mori,et al.  The origin of variable retention time in DRAM , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[4]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[5]  Onur Mutlu,et al.  Tiered-latency DRAM: A low latency and low cost DRAM architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[6]  Said Hamdioui,et al.  Effects of bit line coupling on the faulty behavior of DRAMs , 2004, 22nd IEEE VLSI Test Symposium, 2004. Proceedings..

[7]  Sungho Kang,et al.  New Fault Detection Algorithm for Multi-level Cell Flash Memroies , 2011, 2011 Asian Test Symposium.

[8]  Onur Mutlu,et al.  Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  Ad J. van de Goor,et al.  Influence of bit line twisting on the faulty behavior of DRAMs , 2004, Records of the 2004 International Workshop on Memory Technology, Design and Testing, 2004..

[10]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[11]  Onur Mutlu,et al.  Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[12]  Rachata Ausavarungnirun,et al.  RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data , 2013 .

[13]  Onur Mutlu,et al.  Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Myoung Jin Lee,et al.  A Mechanism for Dependence of Refresh Time on Data Pattern in DRAM , 2010, IEEE Electron Device Letters.

[15]  Ad J. van de Goor,et al.  Address and data scrambling: causes and impact on memory tests , 2002, Proceedings First IEEE International Workshop on Electronic Design, Test and Applications '2002.

[16]  Ad J. van de Goor,et al.  Disturb neighborhood pattern sensitive fault , 1997, Proceedings. 15th IEEE VLSI Test Symposium (Cat. No.97TB100125).

[17]  Onur Mutlu,et al.  Research Problems and Opportunities in Memory Systems , 2014, Supercomput. Front. Innov..

[18]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[19]  Yiorgos Tsiatouhas,et al.  Physical design oriented DRAM Neighborhood Pattern Sensitive Fault testing , 2009, 2009 12th International Symposium on Design and Diagnostics of Electronic Circuits & Systems.

[20]  J. W. Park,et al.  DRAM variable retention time , 1992, 1992 International Technical Digest on Electron Devices Meeting.

[21]  Sang-Bock Cho,et al.  An Efficient Built-in Self-Test Algorithm for Neighborhood Pattern- and Bit-Line-Sensitive Faults in High-Density Memories , 2004 .

[22]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[23]  Ding-Ming Kwai,et al.  An FPGA-based test platform for analyzing data retention time distribution of DRAMs , 2013, 2013 International Symposium onVLSI Design, Automation, and Test (VLSI-DAT).

[24]  D. Yaney,et al.  A meta-stable leakage phenomenon in DRAM charge storage —Variable hold time , 1987, 1987 International Electron Devices Meeting.

[25]  Onur Mutlu,et al.  Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[26]  Osman S. Unsal,et al.  Neighbor-cell assisted error correction for MLC NAND flash memories , 2014, SIGMETRICS '14.

[27]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Bianca Schroeder,et al.  Temperature management in data centers: why some (might) like it hot , 2012, SIGMETRICS '12.

[29]  Onur Mutlu,et al.  Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[30]  Onur Mutlu,et al.  AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[31]  Kiyoo Itoh,et al.  A low-impedance open-bitline array for multigigabit DRAM , 2002 .

[32]  R.M. Sidek,et al.  12N test procedure for NPSF testing and diagnosis for SRAMs , 2008, 2008 IEEE International Conference on Semiconductor Electronics.

[33]  Bruce F. Cockburn,et al.  A transparent built-in self-test scheme for detecting single V-coupling faults in RAMs , 1994, Proceedings of IEEE International Workshop on Memory Technology, Design, and Test.

[34]  Eric Rotenberg,et al.  Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[35]  Bruce F. Cockburn,et al.  An investigation into crosstalk noise in DRAM structures , 2002, Proceedings of the 2002 IEEE International Workshop on Memory Technology, Design and Testing (MTDT2002).

[36]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[37]  Onur Mutlu,et al.  The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study , 2014, SIGMETRICS '14.

[38]  Sudhakar M. Reddy,et al.  Test Procedures for a Class of Pattern-Sensitive Faults in Semiconductor Random-Access Memories , 1980, IEEE Transactions on Computers.

[39]  Onur Mutlu,et al.  ERRoR ANAlysIs AND RETENTIoN-AwARE ERRoR MANAgEMENT FoR NAND FlAsh MEMoRy , 2013 .

[40]  Kinam Kim,et al.  Technology for sub-50nm DRAM and NAND flash manufacturing , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[41]  Marco Ottavi,et al.  Characterization of data retention faults in DRAM devices , 2014, 2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).

[42]  Onur Mutlu,et al.  Data retention in MLC NAND flash memory: Characterization, optimization, and recovery , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[43]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[44]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[45]  P. Girard,et al.  Embedded flash testing: overview and perspectives , 2006, International Conference on Design and Test of Integrated Systems in Nanoscale Technology, 2006. DTIS 2006..

[46]  Onur Mutlu,et al.  An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[47]  Cheng-Wen Wu,et al.  Neighborhood pattern-sensitive fault testing and diagnostics for random-access memories , 2002, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[48]  T. Schloesser,et al.  Challenges for the DRAM cell scaling to 40nm , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[49]  Mark G. Karpovsky,et al.  Exhaustive and Near-Exhaustive Memory Testing Techniques and their BIST Implementations , 1997, J. Electron. Test..

[50]  Jong Kim,et al.  Parallely testable design for detection of neighborhood pattern sensitive faults in high density DRAMs , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[51]  Said Hamdioui,et al.  Space of DRAM Fault Models and Corresponding Testing , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[52]  T. Schloesser,et al.  6F2 buried wordline DRAM cell for 40nm and beyond , 2008, 2008 IEEE International Electron Devices Meeting.

[53]  Onur Mutlu,et al.  Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[54]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[55]  Onur Mutlu,et al.  Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation , 2013, ICCD.

[56]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[57]  Fabrizio Lombardi,et al.  Testing of inter-word coupling faults in word-oriented SRAMs , 2004, 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2004. DFT 2004. Proceedings..

[58]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[59]  Qiang Wu,et al.  Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[60]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[61]  Frans P. M. Beenker,et al.  A realistic fault model and test algorithms for static random access memories , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[62]  Hongzhong Zheng,et al.  Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling , 2014 .

[63]  Robert H. Dennard,et al.  Challenges and future directions for the scaling of dynamic random-access memory (DRAM) , 2002, IBM J. Res. Dev..

[64]  Masashi Horiguchi,et al.  The impact of data-line interference noise on DRAM scaling , 1988 .

[65]  Yiorgos Tsiatouhas,et al.  Layout-Based Refined NPSF Model for DRAM Characterization and Testing , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[66]  Song Liu,et al.  Hardware/software techniques for DRAM thermal management , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[67]  S. Hamdioui,et al.  Evaluation of SRAM faulty behavior under bit line coupling , 2008, 2008 3rd International Design and Test Workshop.

[68]  Onur Mutlu,et al.  Adaptive-latency DRAM: Optimizing DRAM timing for the common-case , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[69]  Norbert Wehn,et al.  Retention time measurements and modelling of bit error rates of WIDE I/O DRAM in MPSoCs , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[70]  G. Srinivasan,et al.  Accurate, predictive modeling of soft error rate due to cosmic rays and chip alpha radiation , 1994, Proceedings of 1994 IEEE International Reliability Physics Symposium.

[71]  Kewal K. Saluja,et al.  Flash memory disturbances: modeling and test , 2001, Proceedings 19th IEEE VLSI Test Symposium. VTS 2001.

[72]  Onur Mutlu,et al.  Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization , 2016, SIGMETRICS.

[73]  André Ivanov,et al.  Reducing test time of embedded SRAMs , 2003, Records of the 2003 International Workshop on Memory Technology, Design and Testing.

[74]  Said Hamdioui,et al.  Defect Oriented Testing of the Strap Problem Under Process Variations in DRAMs , 2008, 2008 IEEE International Test Conference.

[75]  John K. DeBrosse,et al.  Fault-tolerant designs for 256 Mb DRAM , 1995 .

[76]  John Shalf,et al.  Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.

[77]  Onur Mutlu,et al.  ChargeCache: Reducing DRAM latency by exploiting row access locality , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[78]  Onur Mutlu,et al.  Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).