A new class of array codes for memory storage

In this article we describe a class of error control codes called “diff-MDS” codes that are custom designed for highly resilient computer memory storage. The error scenarios of concern range from simple single bit errors, to memory chip failures and catastrophic memory module failures. Our approach to building codes for this setting relies on the concept of expurgating a parity code that is easy to decode for memory module failures so that a few additional small errors can be handled as well, thus preserving most of the decoding complexity advantages of the original code while extending its original intent. The manner in which we expurgate is carefully crafted so that the strength of the resulting code is comparable to that of a Reed-Solomon code when used for this particular setting. An instance of this class of algorithms has been incorporated in IBM's zEnterprise mainframe offering, setting a new industry standard for memory resiliency.

[1]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[2]  Chih-Yuan Lu,et al.  study of incremental step pulse programming (ISPP) and STI edge effect of BE-SONOS NAND Flash , 2008, 2008 IEEE International Reliability Physics Symposium.

[3]  J. Ziegler,et al.  Effect of Cosmic Rays on Computer Memories , 1979, Science.

[4]  Mario Blaum,et al.  On Lowest Density MDS Codes , 1999, IEEE Trans. Inf. Theory.

[5]  Timothy J. Dell,et al.  System RAS implications of DRAM soft errors , 2008, IBM J. Res. Dev..

[6]  Israel Gohberg,et al.  Fast Algorithms with Preprocessing for Matrix-Vector Multiplication Problems , 1994, J. Complex..

[7]  S. H. Reiger,et al.  Codes for the correction of 'clustered' errors , 1960, IRE Trans. Inf. Theory.

[8]  R. Blahut Algebraic Codes for Data Transmission , 2002 .

[9]  Anxiao Jiang,et al.  Floating Codes for Joint Information Storage in Write Asymmetric Memories , 2007, 2007 IEEE International Symposium on Information Theory.

[10]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[11]  Jørn Justesen,et al.  On the complexity of decoding Reed-Solomon codes (Corresp.) , 1976, IEEE Trans. Inf. Theory.

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[14]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[15]  Mark N. Wegman,et al.  On the lifetime of multilevel memories , 2009, 2009 IEEE International Symposium on Information Theory.

[16]  Ron M. Roth,et al.  Author's Reply to Comments on 'Maximum-rank array codes and their application to crisscross error correction' , 1991, IEEE Trans. Inf. Theory.

[17]  Richard C. Singleton,et al.  Maximum distance q -nary codes , 1964, IEEE Trans. Inf. Theory.

[18]  A. J. Han Vinck,et al.  On the Capacity of Generalized Write-Once Memory with State Transitions Described by an Arbitrary Directed Acyclic Graph , 1999, IEEE Trans. Inf. Theory.

[19]  Mario Blaum,et al.  A class of burst error-correcting array codes , 1986, IEEE Trans. Inf. Theory.