Flash Reliability in Production: The Expected and the Unexpected

As solid state drives based on flash technology are becoming a staple for persistent data storage in data centers, it is important to understand their reliability characteristics. While there is a large body of work based on experiments with individual flash chips in a controlled lab environment under synthetic workloads, there is a dearth of information on their behavior in the field. This paper provides a large-scale field study covering many millions of drive days, ten different drive models, different flash technologies (MLC, eMLC, SLC) over 6 years of production use in Google's data centers. We study a wide range of reliability characteristics and come to a number of unexpected conclusions. For example, raw bit error rates (RBER) grow at a much slower rate with wearout than the exponential rate commonly assumed and, more importantly, they are not predictive of uncorrectable errors or other error modes. The widely used metric UBER (uncorrectable bit error rate) is not a meaningful metric, since we see no correlation between the number of reads and the number of uncorrectable errors. We see no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes. Comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors.

[1]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[2]  R. Degraeve,et al.  Analytical percolation model for predicting anomalous charge loss in flash memories , 2004, IEEE Transactions on Electron Devices.

[3]  Keum Hwan Noh,et al.  Abnormal Disturbance Mechanism of Sub-100 nm NAND Flash Memory , 2006 .

[4]  Jin-Ki Kim,et al.  A 3.3 V 32 Mb NAND flash memory with incremental step pulse programming scheme , 1995, Proceedings ISSCC '95 - International Solid-State Circuits Conference.

[5]  R. E. Shiner,et al.  A new reliability model for post-cycling charge retention of flash memories , 2002, 2002 IEEE International Reliability Physics Symposium. Proceedings. 40th Annual (Cat. No.02CH37320).

[6]  E. Parkway Quantifying Reliability of Solid-State Storage from Multiple Aspects , 2011 .

[7]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[9]  Onur Mutlu,et al.  Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Steven Swanson,et al.  The bleak future of NAND flash memory , 2012, FAST.

[11]  Onur Mutlu,et al.  Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[13]  Jae-Duk Lee,et al.  A New Programming Disturbance Phenomenon in NAND Flash Memory By Source/Drain Hot-Electrons Generated By GIDL Current , 2006, 2006 21st IEEE Non-Volatile Semiconductor Memory Workshop.

[14]  Young-Ho Lim,et al.  A 3.3 V 32 Mb NAND flash memory with incremental step pulse programming scheme , 1995 .

[15]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[16]  Onur Mutlu,et al.  Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation , 2013, ICCD.

[17]  Wei Wu,et al.  Optimizing NAND flash-based SSDs via retention relaxation , 2012, FAST.

[18]  Tong Zhang,et al.  Exploiting workload dynamics to improve SSD read latency via differentiated error correction codes , 2013, TODE.

[19]  H. Belgal,et al.  Recovery Effects in the Distributed Cycling of Flash Memories , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[20]  Osman S. Unsal,et al.  Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[21]  Asim Kadav,et al.  Differential RAID: Rethinking RAID for SSD reliability , 2010, ACM Trans. Storage.

[22]  A. Brand,et al.  Novel read disturb failure mechanism induced by FLASH cycling , 1993, 31st Annual Proceedings Reliability Physics 1993.

[23]  Paolo Prinetto,et al.  A cross-layer approach for new reliability-performance trade-offs in MLC NAND flash memories , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  J. Kessenich,et al.  Bit error rate in NAND Flash memories , 2008, 2008 IEEE International Reliability Physics Symposium.