Live together or Die Alone: Block cooperation to extend lifetime of resistive memories

Block-level cooperation is an endurance management technique that operates on top of error correction mechanisms to extend memory lifetimes. Once an error recovery scheme fails to recover from faults in a data block, the entire physical page associated with that block is disabled and becomes unavailable to the physical address space. To reduce the page waste caused by early block failures, other blocks can be used to support the failed block, working cooperatively to keep it alive and extend the faulty page's lifetime. We combine the proposed technique with existing error recovery schemes, such as Error Correction Pointers (ECP) and Aegis, to increase memory lifetimes. Block cooperation is realized through metadata sharing in ECP, where one data block shares its unused metadata with another data block. When combined with Aegis, block cooperation is realized through reorganizing data layout, where blocks possessing few faults come to the aid of failed blocks, bringing them back from the dead. Employing block cooperation at a single level (or multiple levels) on top of ECP and Aegis, we can increase memory lifetimes by 28% (37%), and 8% (14%) on average, respectively.

[1]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[2]  Norman P. Jouppi,et al.  FREE-p: Protecting non-volatile memory against both hard and soft errors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[3]  Hsien-Hsin S. Lee,et al.  SAFER: Stuck-At-Fault Error Recovery for Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Jiwu Shu,et al.  Aegis: Partitioning data block for efficient recovery of stuck-at-faults in phase change memory , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Tao Li,et al.  Characterizing and mitigating the impact of process variations on phase change based memory systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Rami G. Melhem,et al.  RDIS: A recursively defined invertible set scheme to tolerate multiple stuck-at faults in resistive memory , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[7]  Haralampos Pozidis,et al.  Multilevel-Cell Phase-Change Memory: A Viable Technology , 2016, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[8]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[9]  Kinam Kim,et al.  Technology for sub-50nm DRAM and NAND flash manufacturing , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[10]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Jongman Kim,et al.  A dual-phase compression mechanism for hybrid DRAM/PCM main memory architectures , 2012, GLSVLSI '12.

[12]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[13]  Karin Strauss,et al.  Zombie memory: Extending memory lifetime by reviving dead blocks , 2013, ISCA.

[14]  M. Welbourne More on Moore , 1992 .

[15]  Qi Wang,et al.  A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth , 2012, 2012 IEEE International Solid-State Circuits Conference.

[16]  Moinuddin K. Qureshi Pay-As-You-Go: Low-overhead hard-error correction for phase change memories , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).