Investigating power outage effects on reliability of solid-state drives

Solid-State Drives (SSDs) are recently employed in enterprise servers and high-end storage systems in order to enhance performance of storage subsystem. Although employing high speed SSDs in the storage subsystems can significantly improve system performance, it comes with significant reliability threat for write operations upon power failures. In this paper, we present a comprehensive analysis investigating the impact of workload dependent parameters on the reliability of SSDs under power failure for variety of SSDs (from top manufacturers). To this end, we first develop a platform to perform two important features required for study: a) a realistic fault injection into the SSD in the computing systems and b) data loss detection mechanism on the SSD upon power failure. In the proposed physical fault injection platform, SSDs experience a real discharge phase of Power Supply Unit (PSU) that occurs during power failure in data centers which was neglected in previous studies. The impact of workload dependent parameters such as workload Working Set Size (WSS), request size, request type, access pattern, and sequence of accesses on the failure of SSDs is carefully studied in the presence of realistic power failures. Experimental results over thousands number of fault injections show that data loss occurs even after completion of the request (up to 700ms) where the failure rate is influenced by the type, size, access pattern, and sequence of IO accesses while other parameters such as workload WSS has no impact on the failure of SSDs.

[1]  Hossein Asadi,et al.  Operating system level data tiering using online workload characterization , 2015, The Journal of Supercomputing.

[2]  Jie Liu,et al.  SSD Failures in Datacenters: What? When? and Why? , 2016, SYSTOR.

[3]  Sivan Toledo,et al.  Algorithms and data structures for flash memories , 2005, CSUR.

[4]  Tian Luo,et al.  CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives , 2011, FAST.

[5]  Roberto Bez,et al.  Introduction to flash memory , 2003, Proc. IEEE.

[6]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Mark Lillibridge,et al.  Reliability Analysis of SSDs Under Power Fault , 2016, ACM Trans. Comput. Syst..

[8]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[9]  Rino Micheloni,et al.  Inside Solid State Drives (Ssds) , 2012 .

[10]  Mahesh Balakrishnan,et al.  Extending SSD Lifetimes with Disk-Based Write Caches , 2010, FAST.

[11]  Onur Mutlu,et al.  Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  刘骅毅 A computer power supply , 2011 .

[13]  Steven Swanson,et al.  Understanding the impact of power loss on flash memory , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[14]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[15]  Peter Desnoyers,et al.  Write Endurance in Flash Drives: Measurements and Analysis , 2010, FAST.

[16]  Sang Lyul Min,et al.  Virtual framework for testing the reliability of system software on embedded systems , 2007, SAC '07.

[17]  Onur Mutlu,et al.  Data retention in MLC NAND flash memory: Characterization, optimization, and recovery , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[18]  Mark Lillibridge,et al.  Understanding the robustness of SSDS under power fault , 2013, FAST.

[19]  Reza Salkhordeh,et al.  ReCA: An Efficient Reconfigurable Cache Architecture for Storage Systems with Online Workload Characterization , 2018, IEEE Transactions on Parallel and Distributed Systems.