Reliability Analysis of SSDs Under Power Fault

Modern storage technology (solid-state disks (SSDs), NoSQL databases, commoditized RAID hardware, etc.) brings new reliability challenges to the already-complicated storage stack. Among other things, the behavior of these new components during power faults—which happen relatively frequently in data centers—is an important yet mostly ignored issue in this dependability-critical area. Understanding how new storage components behave under power fault is the first step towards designing new robust storage systems. In this article, we propose a new methodology to expose reliability issues in block devices under power faults. Our framework includes specially designed hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures. Applying our testing framework, we test 17 commodity SSDs from six different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that 14 of the 17 tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure.

[1]  Garth A. Gibson Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis , 1990 .

[2]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[3]  G. Atwood,et al.  Erratic Erase In ETOX/sup TM/ Flash Memory Array , 1993, Symposium 1993 on VLSI Technology.

[4]  A. Brand,et al.  Novel read disturb failure mechanism induced by FLASH cycling , 1993, 31st Annual Proceedings Reliability Physics 1993.

[5]  Young-Ho Lim,et al.  A 3.3 V 32 Mb NAND flash memory with incremental step pulse programming scheme , 1995 .

[6]  Gregory R. Ganger,et al.  The DiskSim Simulation Environment Version 4.0 Reference Manual (CMU-PDL-08-101) , 1998 .

[7]  Ken Takeuchi,et al.  A multipage cell architecture for high-speed programming multilevel NAND flash memories , 1998, IEEE J. Solid State Circuits.

[8]  R. E. Shiner,et al.  A new reliability model for post-cycling charge retention of flash memories , 2002, 2002 IEEE International Reliability Physics Symposium. Proceedings. 40th Annual (Cat. No.02CH37320).

[9]  A. Tal White Paper Two Flash Technologies Compared : NOR vs NAND , 2002 .

[10]  Roberto Bez,et al.  Introduction to flash memory , 2003, Proc. IEEE.

[11]  Sivan Toledo,et al.  Algorithms and data structures for flash memories , 2005, CSUR.

[12]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[13]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[14]  K. Otsuga,et al.  The Impact of Random Telegraph Signals on the Scaling of Multilevel Flash Memories , 2006, 2006 Symposium on VLSI Circuits, 2006. Digest of Technical Papers..

[15]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[16]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[17]  Ozalp Babaoglu,et al.  ACM Transactions on Computer Systems , 2007 .

[18]  Michael Isard,et al.  A design for high-performance flash disks , 2007, OPSR.

[19]  Andrea C. Arpaci-Dusseau,et al.  EIO: Error Handling is Occasionally Correct , 2008, FAST.

[20]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[21]  Andrea C. Arpaci-Dusseau,et al.  Parity Lost and Parity Regained , 2008, FAST.

[22]  Anand Kulkarni,et al.  nand Flash Memory and Its Role in Storage Architectures , 2008, Proceedings of the IEEE.

[23]  Young-Jin Kim,et al.  LAST: locality-aware sector translation for NAND flash memory-based storage systems , 2008, OPSR.

[24]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[25]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Xiaodong Zhang,et al.  Understanding intrinsic characteristics and system implications of flash memory based solid state drives , 2009, SIGMETRICS '09.

[27]  James S. Plank,et al.  Mean Time to Meaningless: MTTDL, Markov Models, and Storage System Reliability , 2010, HotStorage.

[28]  Sebastian Burckhardt,et al.  Effective Data-Race Detection for the Kernel , 2010, OSDI.

[29]  Andrea C. Arpaci-Dusseau,et al.  Coerced Cache Eviction and discreet mode journaling: Dealing with misbehaving disks , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[30]  Steven Swanson,et al.  Understanding the impact of power loss on flash memory , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[31]  Xiaozhou Li,et al.  Analyzing consistency properties for fun and profit , 2011, PODC '11.

[32]  Wei Wu,et al.  Optimizing NAND flash-based SSDs via retention relaxation , 2012, FAST.

[33]  Angela Demke Brown,et al.  Recon: Verifying file system consistency at runtime , 2012, TOS.

[34]  Andrea C. Arpaci-Dusseau,et al.  De-indirection for flash-based SSDs with nameless writes , 2012, FAST.

[35]  Onur Mutlu,et al.  Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36]  Andrea C. Arpaci-Dusseau,et al.  Consistency without ordering , 2012, FAST.

[37]  Steven Swanson,et al.  The bleak future of NAND flash memory , 2012, FAST.

[38]  Youyou Lu,et al.  Extending the lifetime of flash-based storage through reducing write amplification from file systems , 2013, FAST.

[39]  Philippe Bonnet,et al.  Linux block IO: introducing multi-queue SSD access on multi-core systems , 2013, SYSTOR '13.

[40]  Mark Lillibridge,et al.  Understanding the robustness of SSDS under power fault , 2013, FAST.

[41]  Andrea C. Arpaci-Dusseau,et al.  A Study of Linux File System Evolution , 2013, FAST.

[42]  Mark Lillibridge,et al.  Torturing Databases for Fun and Profit , 2014, OSDI.

[43]  Paul H. Siegel,et al.  Constrained Codes that Mitigate Inter-Cell Interference in Read/Write Cycles for Flash Memories , 2014, IEEE Journal on Selected Areas in Communications.

[44]  Andrea C. Arpaci-Dusseau,et al.  All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[45]  Rodrigo Rodrigues,et al.  SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration , 2014, OSDI.

[46]  Bharath Ramsundar,et al.  NVMKV: A Scalable and Lightweight Flash Aware Key-Value Store , 2014, HotStorage.

[47]  Wei Wang,et al.  ReconFS: a reconstructable file system on flash storage , 2014, FAST.

[48]  Xavier Jimenez,et al.  Wear unleveling: improving NAND flash lifetime by balancing page endurance , 2014, FAST.

[49]  Osman S. Unsal,et al.  Neighbor-cell assisted error correction for MLC NAND flash memories , 2014, SIGMETRICS '14.

[50]  Ming Zhao,et al.  How Much Can Data Compressibility Help to Improve NAND Flash Memory Lifetime? , 2015, FAST.

[51]  Paul H. Siegel,et al.  Error analysis and inter-cell interference mitigation in multi-level cell flash memories , 2015, 2015 IEEE International Conference on Communications (ICC).

[52]  Joo Young Hwang,et al.  F2FS: A New File System for Flash Storage , 2015, FAST.

[53]  Changwoo Min,et al.  Cross-checking semantic correctness: the case of finding file system bugs , 2015, SOSP.

[54]  Eitan Yaakobi,et al.  Write Once, Get 50% Free: Saving SSD Erase Costs Using WOM Codes , 2015, FAST.

[55]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[56]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[57]  Adam Chlipala,et al.  Using Crash Hoare logic for certifying the FSCQ file system , 2015, USENIX Annual Technical Conference.