Zettabyte reliability with flexible end-to-end data integrity

We introduce flexible end-to-end data integrity for storage systems, which enables each component along the I/O path (e.g., memory, disk) to alter its protection scheme to meet the performance and reliability demands of the system. We apply this new concept to Zettabyte File System (ZFS) and build Zettabyte-Reliable ZFS (Z2FS). Z2FS provides dynamical tradeoffs between performance and protection and offers Zettabyte Reliability, which is one undetected corruption per Zettabyte of data read. We develop an analytical framework to evaluate reliability; the protection approaches in Z2FS are built upon the foundations of the framework. For comparison, we implement a straightforward End-to-End ZFS (E2ZFS) with the same protection scheme for all components. Through analysis and experiment, we show that Z2FS is able to achieve better overall performance than E2ZFS, while still offering Zettabyte Reliability.

[1]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[2]  YangJunfeng,et al.  An empirical study of operating systems errors , 2001 .

[3]  Pin Zhou,et al.  Evaluating the impact of Undetected Disk Errors in RAID systems , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[4]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[5]  Xin Li,et al.  A Memory Soft Error Measurement on Production Systems , 2007, USENIX Annual Technical Conference.

[6]  Andrea C. Arpaci-Dusseau,et al.  End-to-end Data Integrity for File Systems: A ZFS Case Study , 2010, FAST.

[7]  James L. Walsh,et al.  Field testing for cosmic ray soft errors in semiconductor memories , 1996, IBM J. Res. Dev..

[8]  Dafna Sheinwald,et al.  Internet Protocol Small Computer System Interface (iSCSI) Cyclic Redundancy Check (CRC)/Checksum Considerations , 2002, RFC.

[9]  Joseph A. Catania Soft Errors in Electronic Memory – A White Paper , 2022 .

[10]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[11]  E. Normand Single event upset at ground level , 1996 .

[12]  Julian Satran,et al.  Internet Small Computer Systems Interface (iSCSI) , 2004, RFC.

[13]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[14]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[15]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[16]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[17]  Ari Juels,et al.  A Clean-Slate Look at Disk Scrubbing , 2010, FAST.

[18]  J. Ziegler,et al.  Effect of Cosmic Rays on Computer Memories , 1979, Science.

[19]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[20]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[21]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[22]  Michael G. Pecht,et al.  Enhanced Reliability Modeling of RAID Storage Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[23]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[24]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[25]  J. Mugler,et al.  Proceedings Formatting Team , 2002 .

[26]  Walter A. Burkhard,et al.  Disk array storage system reliability , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[27]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[28]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[29]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[30]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2005, TOCS.

[31]  Xin Li,et al.  A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[32]  T. C. Maxino,et al.  The Effectiveness of Checksums for Embedded Control Networks , 2009, IEEE Transactions on Dependable and Secure Computing.

[33]  Andrea C. Arpaci-Dusseau,et al.  Parity Lost and Parity Regained , 2008, FAST.

[34]  Hsiao-Keng Jerry Chu,et al.  Zero-Copy TCP in Solaris , 1996, USENIX Annual Technical Conference.