A Large-Scale Study of Flash Memory Failures in the Field

Servers use flash memory based solid state drives (SSDs) as a high-performance alternative to hard disk drives to store persistent data. Unfortunately, recent increases in flash density have also brought about decreases in chip-level reliability. In a data center environment, flash-based SSD failures can lead to downtime and, in the worst case, data loss. As a result, it is important to understand flash memory reliability characteristics over flash lifetime in a realistic production data center environment running modern applications and system software. This paper presents the first large-scale study of flash-based SSD reliability in the field. We analyze data collected across a majority of flash-based solid state drives at Facebook data centers over nearly four years and many millions of operational hours in order to understand failure properties and trends of flash-based SSDs. Our study considers a variety of SSD characteristics, including: the amount of data written to and read from flash chips; how data is mapped within the SSD address space; the amount of data copied, erased, and discarded by the flash controller; and flash board temperature and bus power. Based on our field analysis of how flash memory errors manifest when running modern workloads on modern SSDs, this paper is the first to make several major observations: (1) SSD failure rates do not increase monotonically with flash chip wear; instead they go through several distinct periods corresponding to how failures emerge and are subsequently detected, (2) the effects of read disturbance errors are not prevalent in the field, (3) sparse logical data layout across an SSD's physical address space (e.g., non-contiguous data), as measured by the amount of metadata required to track logical address translations stored in an SSD-internal DRAM buffer, can greatly affect SSD failure rate, (4) higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures, and (5) data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells due to optimizations in the SSD controller and buffering employed in the system software. We hope that the findings of this first large-scale flash memory reliability study can inspire others to develop other publicly-available analyses and novel flash reliability solutions.

[1]  Onur Mutlu,et al.  Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[2]  J. Kessenich,et al.  Bit error rate in NAND Flash memories , 2008, 2008 IEEE International Reliability Physics Symposium.

[3]  Jae-Duk Lee,et al.  A New Programming Disturbance Phenomenon in NAND Flash Memory By Source/Drain Hot-Electrons Generated By GIDL Current , 2006, 2006 21st IEEE Non-Volatile Semiconductor Memory Workshop.

[4]  K. Imamiya,et al.  A negative Vth cell architecture for highly scalable, excellently noise immune and highly reliable NAND flash memories , 1998, 1998 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.98CH36215).

[5]  Sang-Won Lee,et al.  A survey of Flash Translation Layer , 2009, J. Syst. Archit..

[6]  P. Olivo,et al.  Call for papers - Wireless Technology: Models, Designs and Applications , 2003 .

[7]  Steven Swanson,et al.  The bleak future of NAND flash memory , 2012, FAST.

[8]  Onur Mutlu,et al.  Data retention in MLC NAND flash memory: Characterization, optimization, and recovery , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[9]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[10]  H. Belgal,et al.  Recovery Effects in the Distributed Cycling of Flash Memories , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[11]  Young-Ho Lim,et al.  A 3.3 V 32 Mb NAND flash memory with incremental step pulse programming scheme , 1995 .

[12]  G. Atwood,et al.  Erratic Erase In ETOX/sup TM/ Flash Memory Array , 1993, Symposium 1993 on VLSI Technology.

[13]  Keum Hwan Noh,et al.  Abnormal Disturbance Mechanism of Sub-100 nm NAND Flash Memory , 2006 .

[14]  K. Sakui,et al.  A negative V/sub th/ cell architecture for highly scalable, excellently noise-immune, and highly reliable NAND flash memories , 1999 .

[15]  R. E. Shiner,et al.  A new reliability model for post-cycling charge retention of flash memories , 2002, 2002 IEEE International Reliability Physics Symposium. Proceedings. 40th Annual (Cat. No.02CH37320).

[16]  Osman S. Unsal,et al.  Neighbor-cell assisted error correction for MLC NAND flash memories , 2014, SIGMETRICS '14.

[17]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[18]  Y. Mori,et al.  Analysis of detrap current due to oxide traps to improve flash memory retention , 2000, 2000 IEEE International Reliability Physics Symposium Proceedings. 38th Annual (Cat. No.00CH37059).

[19]  P. Kalavade,et al.  Flash EEPROM threshold instabilities due to charge trapping during program/erase cycling , 2004, IEEE Transactions on Device and Materials Reliability.

[20]  M. Ushiyama,et al.  Read-disturb degradation mechanism due to electron trapping in the tunnel oxide for low-voltage flash memories , 1994, Proceedings of 1994 IEEE International Electron Devices Meeting.

[21]  Onur Mutlu,et al.  Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation , 2013, ICCD.

[22]  Kinam Kim,et al.  Degradation of tunnel oxide by FN current stress and its effects on data retention characteristics of 90 nm NAND flash memory cells , 2003, 2003 IEEE International Reliability Physics Symposium Proceedings, 2003. 41st Annual..

[23]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[24]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[25]  A. Lacaita,et al.  First evidence for injection statistics accuracy limitations in NAND Flash constant-current Fowler-Nordheim programming , 2007, 2007 IEEE International Electron Devices Meeting.

[26]  Yong Wang,et al.  SDF: software-defined flash for web-scale internet storage systems , 2014, ASPLOS.

[27]  J. Yugami,et al.  A novel analysis method of threshold voltage shift due to detrap in a multi-level flash memory , 2001, 2001 Symposium on VLSI Technology. Digest of Technical Papers (IEEE Cat. No.01 CH37184).

[28]  R. Degraeve,et al.  Analytical percolation model for predicting anomalous charge loss in flash memories , 2004, IEEE Transactions on Electron Devices.

[29]  K. Otsuga,et al.  The Impact of Random Telegraph Signals on the Scaling of Multilevel Flash Memories , 2006, 2006 Symposium on VLSI Circuits, 2006. Digest of Technical Papers..

[30]  Onur Mutlu,et al.  Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[31]  Mingzhen Xu,et al.  Extended Arrhenius law of time-to-breakdown of ultrathin gate oxides , 2003 .

[32]  Onur Mutlu,et al.  Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[33]  Osman S. Unsal,et al.  Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[34]  Tae-Sung Jung,et al.  A 3.3 V 128 Mb multi-level NAND flash memory for mass storage applications , 1996, 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC.

[35]  A. Brand,et al.  Novel read disturb failure mechanism induced by FLASH cycling , 1993, 31st Annual Proceedings Reliability Physics 1993.

[36]  Onur Mutlu,et al.  ERRoR ANAlysIs AND RETENTIoN-AwARE ERRoR MANAgEMENT FoR NAND FlAsh MEMoRy , 2013 .