Design Tradeoffs for SSD Reliability

Flash memory-based SSDs are popular across a wide range of data storage markets, while the underlying storage medium—flash memory—is becoming increasingly unreliable. As a result, modern SSDs employ a number of indevice reliability enhancement techniques, but none of them offers a one size fits all solution when considering the multidimensional requirements for SSDs: performance, reliability, and lifetime. In this paper, we examine the design tradeoffs of existing reliability enhancement techniques such as data re-read, intra-SSD redundancy, and data scrubbing. We observe that an uncoordinated use of these techniques adversely affects the performance of the SSD, and careful management of the techniques is necessary for a graceful performance degradation while maintaining a high reliability standard. To that end, we propose a holistic reliability management scheme that selectively employs redundancy, conditionally re-reads, judiciously selects data to scrub. We demonstrate the effectiveness of our scheme by evaluating it across a set of I/O workloads and SSDs wear states.

[1]  Betty Prince,et al.  Vertical 3D Memory Technologies , 2014 .

[2]  Onur Mutlu,et al.  Data retention in MLC NAND flash memory: Characterization, optimization, and recovery , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[3]  Jie Liu,et al.  SSD Failures in Datacenters: What? When? and Why? , 2016, SYSTOR.

[4]  Bianca Schroeder,et al.  Practical scrubbing: Getting to the bad sector at the right time , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[5]  Alma Riska,et al.  Enhancing data availability in disk drives through background activities , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[6]  Onur Mutlu,et al.  HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature Awareness , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  Wei Wu,et al.  Optimizing NAND flash-based SSDs via retention relaxation , 2012, FAST.

[8]  E. Parkway Quantifying Reliability of Solid-State Storage from Multiple Aspects , 2011 .

[9]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[10]  Steven Swanson,et al.  The bleak future of NAND flash memory , 2012, FAST.

[11]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[12]  Sungjin Lee,et al.  Lifetime improvement of NAND flash-based storage systems using dynamic program and erase scaling , 2014, FAST.

[13]  Jongmoo Choi,et al.  WARM: Improving NAND flash memory lifetime with write-hotness aware retention management , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[14]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[15]  Jongmoo Choi,et al.  Chip-Level RAID with Flexible Stripe Size and Parity Placement for Enhanced SSD Reliability , 2016, IEEE Transactions on Computers.

[16]  Onur Mutlu,et al.  Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[17]  Kern Koh,et al.  A lifespan-aware reliability scheme for RAID-based flash storage , 2011, SAC '11.

[18]  Jongmoo Choi,et al.  Incremental redundancy to reduce data retention errors in flash-based SSDs , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[19]  Onur Mutlu,et al.  Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[20]  Jongmoo Choi,et al.  Improving SSD reliability with RAID via Elastic Striping and Anywhere Parity , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[21]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[22]  Andrew A. Chien,et al.  The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments , 2016, FAST.

[23]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[24]  Seiichi Aritome,et al.  Nand Flash Memory Technologies , 2015 .

[25]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[26]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[27]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[28]  Onur Mutlu,et al.  Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives , 2017, Proceedings of the IEEE.

[29]  Sanghyuk Jung,et al.  FRA: a flash-aware redundancy array of flash storage devices , 2009, CODES+ISSS '09.

[30]  Onur Mutlu,et al.  Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[31]  Qi Zhang,et al.  Characterization of storage workload traces from production Windows Servers , 2008, 2008 IEEE International Symposium on Workload Characterization.

[32]  Liang Shi,et al.  Error Model Guided Joint Performance and Endurance Optimization for Flash Memory , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[33]  Xavier Jimenez,et al.  Wear unleveling: improving NAND flash lifetime by balancing page endurance , 2014, FAST.

[34]  Ari Juels,et al.  A Clean-Slate Look at Disk Scrubbing , 2010, FAST.

[35]  Robert B. Ross,et al.  Fail-Slow at Scale , 2018, ACM Trans. Storage.

[36]  Nanning Zheng,et al.  LDPC-in-SSD: making advanced error correction codes work effectively in solid state drives , 2013, FAST.

[37]  Osman S. Unsal,et al.  Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[38]  Jihong Kim,et al.  An Integrated Approach for Managing Read Disturbs in High-Density NAND Flash Memory , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[39]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Sang Lyul Min,et al.  AutoSSD: an Autonomic SSD Architecture , 2018, USENIX Annual Technical Conference.

[41]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[42]  Andrew A. Chien,et al.  Tiny-Tail Flash , 2017, ACM Trans. Storage.

[43]  Onur Mutlu,et al.  Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[44]  Dongkun Shin,et al.  Flash-Aware RAID Techniques for Dependable and High-Performance Flash Memory SSD , 2011, IEEE Transactions on Computers.

[45]  Dongkun Shin,et al.  Reinforcement Learning-Assisted Garbage Collection to Mitigate Long-Tail Latency in SSD , 2017, ACM Trans. Embed. Comput. Syst..

[46]  Neal R. Mielke,et al.  Reliability of Solid-State Drives Based on NAND Flash Memory , 2017, Proceedings of the IEEE.

[47]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.