Evaluating Reliability of SSD-Based I/O Caches in Enterprise Storage Systems

I/O caching techniques are widely employed in enterprise storage systems in order to enhance performance of I/O intensive applications in large-scale data centers. Due to higher performance compared to Hard Disk Drives (HDDs) and lower price and nonvolatility compared to Dynamic Random-Access Memories (DRAM), Flash-based Solid-State Drives (SSDs) are used as a main media in the caching layer of storage systems. Although SSDs are known as non-volatile devices but recent studies have reported large number of data failures due to power outage in SSDs. To overcome the reliability implications of SSD-based I/O caching schemes, RAID-1 (mirrored) configuration is commonly used to avoid data loss due to uncommitted write operations. Such configuration, however, may still experience data loss in the cache layer due to correlated failures in SSDs. To our knowledge, none of previous studies have investigated the reliability of SSD-based I/O caching schemes in enterprise storage systems. In this paper, we present a comprehensive analysis investigating the reliability of SSD-based I/O caching architectures used in enterprise storage systems under power failure and high-operating temperature.We explore variety of SSDs from top vendors and investigate the cache reliability in mirrored configuration. To this end, we first develop a physical fault injection and failure detection platform and then investigate the impact of workload dependent parameters on the reliability of I/O cache in the presence of two common failure types in data centers, power outage and high temperature faults. We implement an I/O cache scheme using an open-source I/O cache module in Linux operating system. The experimental results obtained by conducting more than twenty thousand of physical fault injections on the implemented I/O cache with different write policies reveal that the failure rate of the I/O cache is significantly affected by workload dependent parameters. Our results show that unlike workload requests access pattern, the other workload dependent parameters such as request size, Working Set Size (WSS), and sequence of the accesses have considerable impact on the I/O cache failure rate. We observe a significant growth in the failure rate in the workloads by decreasing the size of the requests (by more than 14X). Furthermore, we observe that in addition to writes, the read accesses to the I/O cache are subjected to failure in presence of sudden power outage (the failure mainly occurs during promoting data to the cache). In addition, we observe that I/O cache experiences no data failure upon high temperature faults.

[1]  Mark Lillibridge,et al.  Understanding the robustness of SSDS under power fault , 2013, FAST.

[2]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Jie Liu,et al.  SSD Failures in Datacenters: What? When? and Why? , 2016, SYSTOR.

[4]  Steven Swanson,et al.  Understanding the impact of power loss on flash memory , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Chang Liu,et al.  Performance Comparison of Mirrored Disk Scheduling Methods with a Shared Non-Volatile Cache , 2005, Distributed and Parallel Databases.

[6]  Onur Mutlu,et al.  Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives , 2017, Proceedings of the IEEE.

[7]  Onur Mutlu,et al.  Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery , 2017, ArXiv.

[8]  Reza Salkhordeh,et al.  ReCA: An Efficient Reconfigurable Cache Architecture for Storage Systems with Online Workload Characterization , 2018, IEEE Transactions on Parallel and Distributed Systems.

[9]  Tei-Wei Kuo,et al.  Disturbance Relaxation for 3D Flash Memory , 2016, IEEE Transactions on Computers.

[10]  Adam Leventhal,et al.  Flash storage memory , 2008, CACM.

[11]  Reza Salkhordeh,et al.  LBICA: A Load Balancer for I/O Cache Architectures , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Peter Desnoyers,et al.  Write Endurance in Flash Drives: Measurements and Analysis , 2010, FAST.

[13]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[14]  Onur Mutlu,et al.  ERRoR ANAlysIs AND RETENTIoN-AwARE ERRoR MANAgEMENT FoR NAND FlAsh MEMoRy , 2013 .

[15]  Yixin Luo,et al.  Architectural Techniques for Improving NAND Flash Memory Reliability , 2018, ArXiv.

[16]  Hossein Asadi,et al.  Operating system level data tiering using online workload characterization , 2015, The Journal of Supercomputing.

[17]  Jin Qian,et al.  PARAID: A gear-shifting power-aware RAID , 2007, TOS.

[18]  Alexander Thomasian,et al.  Performance, reliability, and performability of a hybrid RAID array and a comparison with traditional RAID1 arrays , 2012, Cluster Computing.

[19]  Rino Micheloni,et al.  Inside Solid State Drives (Ssds) , 2012 .

[20]  Onur Mutlu,et al.  Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Onur Mutlu,et al.  ECI-Cache: A High-Endurance and Cost-Efficient I/O Caching Scheme for Virtualized Platforms , 2018, SIGMETRICS.

[22]  Alexander Thomasian Multilevel RAID Disk Arrays , 2006 .

[23]  Piero Olivo,et al.  LDPC Soft Decoding with Improved Performance in 1X-2X MLC and TLC NAND Flash-Based Solid State Drives , 2019, IEEE Transactions on Emerging Topics in Computing.

[24]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[25]  Onur Mutlu,et al.  Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26]  Onur Mutlu,et al.  Data retention in MLC NAND flash memory: Characterization, optimization, and recovery , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[27]  Cristian Zambelli,et al.  An Automated Test Equipment for Characterization of Emerging MRAM and RRAM Arrays , 2018, IEEE Transactions on Emerging Topics in Computing.

[28]  Zhipeng Li,et al.  Workload-Aware Elastic Striping With Hot Data Identification for SSD RAID Arrays , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[30]  Hossein Asadi,et al.  Investigating power outage effects on reliability of solid-state drives , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[31]  Mario Blaum,et al.  Higher reliability redundant disk arrays: Organization, operation, and coding , 2009, TOS.

[32]  Antonio Rubio,et al.  Memristive Crossbar Memory Lifetime Evaluation and Reconfiguration Strategies , 2018, IEEE Transactions on Emerging Topics in Computing.

[33]  Sang Lyul Min,et al.  Virtual framework for testing the reliability of system software on embedded systems , 2007, SAC '07.

[34]  Alexander Thomasian,et al.  Mirrored disk rouing and scheduling , 2006, Cluster Computing.