FLOWER and FaME: A Low Overhead Bit-Level Fault-map and Fault-Tolerance Approach for Deeply Scaled Memories

To maintain appropriate yields in deeply scaled technologies requires fault-tolerance of increasingly high fault rates. These fault rates far exceed traditional general approaches such as ECC, particularly when faults accrue over time. Effective fault tolerance at such high fault rates requires detailed bit-level knowledge of the location of faulty cells. We provide a solution to this problem in the form of a space efficient, bit-level fault map called FLOWER. FLOWER utilizes Bloom filters to provide detailed fault characterization for a relatively small overhead. We demonstrate how FLOWER can enable improved fault tolerance at high fault rates by enhancing existing fault tolerance proposals and yielding 10–100x improvements. Using in-memory processing, FLOWER can maintain a less than 2% performance overhead at 10E-4 fault rates with less than 2% loss of memory density to report bit-level faults with high accuracy. Using a tuned novel hashing technique called MinCI, FLOWER for memory achieves considerably lower false positives than with disk-level hashing techniques at a fraction of the performance overhead. With a new technique to protect against errors during in-memory operations, PETAL bits, FLOWER can remain resilient against random errors while efficiently targeting predictable errors. Furthermore, we propose a new fault tolerance scheme called FaME, which provides ultra-efficient bit-level sparing by using the FLOWER fault map to identify the location of faults. FLOWER+FaME can achieve 14x longer PCM memory lifetime with half the area overhead versus SECDED ECC.

[1]  Rami G. Melhem,et al.  RDIS: A recursively defined invertible set scheme to tolerate multiple stuck-at faults in resistive memory , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[2]  Rami G. Melhem,et al.  Mitigating bitline crosstalk noise in DRAM memories , 2017, MEMSYS.

[3]  Dirk Wouters,et al.  Emerging Non-Volatile Memories , 2014 .

[4]  Jong-Ho Kang,et al.  A 1.2V 23nm 6F2 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture , 2012, 2012 IEEE International Solid-State Circuits Conference.

[5]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[6]  Y. Konishi,et al.  Analysis of coupling noise between adjacent bit lines in megabit DRAMs , 1989 .

[7]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[8]  Peter K. Pearson,et al.  Fast hashing of variable-length text strings , 1990, CACM.

[9]  Fan Deng,et al.  Approximately detecting duplicates for streaming data using stable bloom filters , 2006, SIGMOD Conference.

[10]  Sukhan Lee,et al.  CiDRA: A cache-inspired DRAM resilience architecture , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[11]  Rami G. Melhem,et al.  Yoda: Judge Me by My Size, Do You? , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[12]  Onur Mutlu,et al.  The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study , 2014, SIGMETRICS '14.

[13]  DharmapurikarSarang,et al.  Fast hash table lookup using extended bloom filter , 2005 .

[14]  Peter Sanders,et al.  Cache-, hash-, and space-efficient bloom filters , 2009, JEAL.

[15]  Philip G. Emma,et al.  Rethinking Refresh: Increasing Availability and Reducing Power in DRAM for Cache Applications , 2008, IEEE Micro.

[16]  Chia-Lin Yang,et al.  SECRET: Selective error correction for refresh energy reduction in DRAMs , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[17]  Kartik Mohanram,et al.  Reliable Nonvolatile Memories: Techniques and Measures , 2017, IEEE Design & Test.

[18]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  Zaid Al-Ars DRAM fault analysis and test generation , 2005 .

[20]  Jun Yang,et al.  Mitigating Write Disturbance in Super-Dense Phase Change Memories , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[21]  Tao Yuan,et al.  Yield Prediction for Integrated Circuits Manufacturing Through Hierarchical Bayesian Modeling of Spatial Defects , 2011, IEEE Transactions on Reliability.

[22]  Jiwu Shu,et al.  Aegis: Partitioning data block for efficient recovery of stuck-at-faults in phase change memory , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[24]  Onur Mutlu,et al.  PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[25]  Rami G. Melhem,et al.  Mitigating Wordline Crosstalk Using Adaptive Trees of Counters , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[26]  Onur Mutlu,et al.  Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization , 2016, SIGMETRICS.

[27]  Alex K. Jones,et al.  Counter Advance for Reliable Encryption in Phase Change Memory , 2018, IEEE Computer Architecture Letters.

[28]  Rami G. Melhem,et al.  Sustainable fault management and error correction for next-generation main memories , 2017, 2017 Eighth International Green and Sustainable Computing Conference (IGSC).

[29]  Kinam Kim,et al.  Technology for sub-50nm DRAM and NAND flash manufacturing , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[30]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[31]  Kartik Mohanram,et al.  ECS: Error-Correcting Strings for Lifetime Improvements in Nonvolatile Memories , 2017, ACM Trans. Archit. Code Optim..

[32]  Rami G. Melhem,et al.  Increasing PCM main memory lifetime , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[33]  Cong Xu,et al.  Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[34]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[35]  Sameh Elnikety,et al.  BitFunnel: Revisiting Signatures for Search , 2017, SIGIR.

[36]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[37]  Samiha Mourad,et al.  Crosstalk Induced Fault Analysis and Test in DRAMs , 2006, J. Electron. Test..

[38]  John R. Carson The Heaviside operational calculus , 1922 .

[39]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Jalil Boukhobza,et al.  Emerging Non-volatile Memories , 2017 .

[41]  Peeter Jürviste Fast Hash Table Lookup Using Extended Bloom Filter , 2011 .

[42]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Onur Mutlu,et al.  An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[44]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[45]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[46]  Hsien-Hsin S. Lee,et al.  SAFER: Stuck-At-Fault Error Recovery for Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[47]  Rami G. Melhem,et al.  Dynamic partitioning to mitigate stuck-at faults in emerging memories , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[48]  Marios C. Papaefthymiou,et al.  Block-based multiperiod dynamic memory design for low data-retention power , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[49]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[50]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..