Reliability of SSDs in Enterprise Storage Systems

This article presents the first large-scale field study of NAND-based SSDs in enterprise storage systems (in contrast to drives in distributed data center storage systems). The study is based on a very comprehensive set of field data, covering 1.6 million SSDs of a major storage vendor (NetApp). The drives comprise three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies (SLC, cMLC, eMLC, 3D-TLC). The data allows us to study a large number of factors that were not studied in prior works, including the effect of firmware versions, the reliability of TLC NAND, and the correlations between drives within a RAID system. This article presents our analysis, along with a number of practical implications derived from it.

[1]  Harendra Kumar,et al.  High Performance Metadata Integrity Protection in the WAFL Copy-on-Write File System , 2017, FAST.

[2]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[3]  Osman S. Unsal,et al.  Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[4]  S. Shah,et al.  Reliability analysis of disk drive failure mechanisms , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[5]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[6]  Jie Liu,et al.  SSD Failures in Datacenters: What? When? and Why? , 2016, SYSTOR.

[7]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[8]  Keum Hwan Noh,et al.  Abnormal Disturbance Mechanism of Sub-100 nm NAND Flash Memory , 2006 .

[9]  Karan Gupta,et al.  IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services , 2019, USENIX Annual Technical Conference.

[10]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[11]  Robert Ricci,et al.  Taming Performance Variability , 2018, OSDI.

[12]  Arkady Kanevsky,et al.  FlexVol: Flexible, Efficient File Volume Virtualization in WAFL , 2008, USENIX Annual Technical Conference.

[13]  Feng-Bin Sun,et al.  A comprehensive review of hard-disk drive reliability , 1999, Annual Reliability and Maintainability. Symposium. 1999 Proceedings (Cat. No.99CH36283).

[14]  Sangyeun Cho,et al.  The Multi-streamed Solid-State Drive , 2014, HotStorage.

[15]  J. Kessenich,et al.  Bit error rate in NAND Flash memories , 2008, 2008 IEEE International Reliability Physics Symposium.

[16]  Harendra Kumar,et al.  WAFL Iron: Repairing Live Enterprise File Systems , 2018, FAST.

[17]  Onur Mutlu,et al.  Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation , 2013, ICCD.

[18]  H. Belgal,et al.  Recovery Effects in the Distributed Cycling of Flash Memories , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[19]  Bianca Schroeder,et al.  Temperature management in data centers: why some (might) like it hot , 2012, SIGMETRICS '12.

[20]  R. E. Shiner,et al.  A new reliability model for post-cycling charge retention of flash memories , 2002, 2002 IEEE International Reliability Physics Symposium. Proceedings. 40th Annual (Cat. No.02CH37320).

[21]  Onur Mutlu,et al.  Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[22]  Ram Kesavan,et al.  FlexGroup Volumes: A Distributed WAFL File System , 2019, USENIX Annual Technical Conference.

[23]  Balazs Gerofi,et al.  Mitigating Negative Impacts of Read Disturb in SSDs , 2020, ACM Trans. Design Autom. Electr. Syst..

[24]  Mahmut T. Kandemir,et al.  Taking Garbage Collection Overheads Off the Critical Path in SSDs , 2012, Middleware.

[25]  Peter Desnoyers,et al.  Write Endurance in Flash Drives: Measurements and Analysis , 2010, FAST.

[26]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[27]  Mahmut T. Kandemir,et al.  Revisiting widely held SSD expectations and rethinking system-level implications , 2013, SIGMETRICS '13.

[28]  Ram Kesavan,et al.  Efficient Search for Free Blocks in the WAFL File System , 2018, ICPP.

[29]  Jiesheng Wu,et al.  Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures , 2019, USENIX Annual Technical Conference.

[30]  Andrew A. Chien,et al.  The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments , 2016, FAST.

[31]  Heiner Litz,et al.  Improving the accuracy, adaptability, and interpretability of SSD failure prediction models , 2020, SoCC.

[32]  Peter F. Corbett,et al.  RAID triple parity , 2012, OPSR.

[33]  John D. Davis,et al.  Block Management in Solid-State Devices , 2009, USENIX Annual Technical Conference.

[34]  Evgenia Smirni,et al.  SSD failures in the field: symptoms, causes, and prediction models , 2019, SC.

[35]  Evangelos Eleftheriou,et al.  Write amplification analysis in flash-based solid state drives , 2009, SYSTOR '09.

[36]  Carlos Maltzahn,et al.  Flash on Rails: Consistent Flash Performance through Redundancy , 2014, USENIX Annual Technical Conference.

[37]  Sriram Sankar,et al.  Environmental Conditions and Disk Reliability in Free-cooled Datacenters , 2016, USENIX Annual Technical Conference.

[38]  R. Degraeve,et al.  Analytical percolation model for predicting anomalous charge loss in flash memories , 2004, IEEE Transactions on Electron Devices.

[39]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[40]  Onur Mutlu,et al.  Data retention in MLC NAND flash memory: Characterization, optimization, and recovery , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[41]  Bianca Schroeder,et al.  Proactive error prediction to improve storage system reliability , 2017, USENIX ATC.

[42]  A. Brand,et al.  Novel read disturb failure mechanism induced by FLASH cycling , 1993, 31st Annual Proceedings Reliability Physics 1993.

[43]  Young-Ho Lim,et al.  A 3.3 V 32 Mb NAND flash memory with incremental step pulse programming scheme , 1995 .

[44]  Mark Lillibridge,et al.  Understanding the robustness of SSDS under power fault , 2013, FAST.

[45]  Xiaodong Zhang,et al.  Understanding intrinsic characteristics and system implications of flash memory based solid state drives , 2009, SIGMETRICS '09.

[46]  Jim Gray,et al.  Empirical Measurements of Disk Failure Rates and Error Rates , 2007, ArXiv.

[47]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[48]  Mark Lillibridge,et al.  Reliability Analysis of SSDs Under Power Fault , 2016, ACM Trans. Comput. Syst..

[49]  J. G. Elerath AFR: problems of definition, calculation and measurement in a commercial environment , 2000, Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055).

[50]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[51]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[52]  Robert B. Ross,et al.  Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems , 2018, FAST.

[53]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[54]  Steven Swanson,et al.  Understanding the impact of power loss on flash memory , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[55]  Kai Shen,et al.  FIOS: a fair, efficient flash I/O scheduler , 2012, FAST.

[56]  Matias Bjørling,et al.  From Open-Channel SSDs to Zoned Namespaces , 2019 .

[57]  Steven Swanson,et al.  The bleak future of NAND flash memory , 2012, FAST.

[58]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[59]  J. G. Elerath Specifying reliability in the disk drive industry: No more MTBF's , 2000, Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055).

[60]  Neal R. Mielke,et al.  Reliability of Solid-State Drives Based on NAND Flash Memory , 2017, Proceedings of the IEEE.

[61]  Xiaodong Zhang,et al.  Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[62]  S. Shah,et al.  Disk drive vintage and its effect on reliability , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[63]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[64]  Jae-Duk Lee,et al.  A New Programming Disturbance Phenomenon in NAND Flash Memory By Source/Drain Hot-Electrons Generated By GIDL Current , 2006, 2006 21st IEEE Non-Volatile Semiconductor Memory Workshop.

[65]  Sang Lyul Min,et al.  Design Tradeoffs for SSD Reliability , 2019, FAST.

[66]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[67]  Roberto Bez,et al.  Failure mechanisms of flash cell in program/erase cycling , 1994, Proceedings of 1994 IEEE International Electron Devices Meeting.

[68]  Peter Desnoyers,et al.  Analytic Models of SSD Write Performance , 2014, TOS.

[69]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[70]  Marcus Marrow,et al.  A closed-form expression for write amplification in NAND Flash , 2010, 2010 IEEE Globecom Workshops.