Modeling Impact of Human Errors on the Data Unavailability and Data Loss of Storage Systems

Data storage systems (DSSs) and their availability play a crucial role in contemporary datacenters. Despite using mechanisms such as automatic failover in datacenters, the role of human agents and consequently their destructive errors is inevitable. Due to very large number of disk drives used in exascale datacenters and their high failure rates, the disk subsystem in storage systems has become a major source of <italic>data unavailability</italic> (DU) and <italic>data loss</italic> (DL) initiated by human errors. In this paper, we investigate the effect of <italic>incorrect disk replacement service</italic> (IDRS) on the availability and reliability of DSSs. To this end, we analyze the consequences of IDRS in a disk array, and conduct Monte Carlo simulations to evaluate DU and DL during mission time. The proposed modeling framework can cope with different storage array configurations and <italic> data object survivability</italic>, representing the effect of system-level redundancies such as remote backups and mirrors. In the proposed framework, the model parameters are obtained from industrial and scientific reports alongside field data, which have been extracted from a datacenter operating with 70 storage racks. The results show that ignoring the impact of IDRS leads to unavailability underestimation by up to three orders of magnitude. Moreover, our study suggests that by considering the effect of human errors, the conventional beliefs about the dependability of different <italic>redundant array of independent disks</italic> (RAID) mechanisms should be revised. The results show that <inline-formula><tex-math notation="LaTeX">$\text{RAID}1$</tex-math></inline-formula> can result in lower availability compared to <inline-formula><tex-math notation="LaTeX">$\text{RAID}5$</tex-math></inline-formula> in the presence of human errors. The results also show that employing automatic fail-over policy (using hot spare disks) can reduce the drastic impacts of human errors by two orders of magnitude.

[1]  Hossein Asadi,et al.  Investigating power outage effects on reliability of solid-state drives , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[3]  G.E. Apostolakis,et al.  Effect of Human Error on The Availability of Periodically Inspected Redundant Systems , 1977, IEEE Transactions on Reliability.

[4]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[5]  Mario Blaum,et al.  Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems , 2014, TOS.

[6]  Mario Blaum,et al.  Partial-MDS Codes and Their Application to RAID Type of Architectures , 2012, IEEE Transactions on Information Theory.

[7]  James S. Plank,et al.  Mean Time to Meaningless: MTTDL, Markov Models, and Storage System Reliability , 2010, HotStorage.

[8]  Jehoshua Bruck,et al.  X-Code: MDS Array Codes with Optimal Encoding , 1999, IEEE Trans. Inf. Theory.

[9]  David S. H. Rosenthal,et al.  Bit Preservation: A Solved Problem? , 2010, Int. J. Digit. Curation.

[10]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[11]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[12]  Jill Brown,et al.  To err is human , 2011 .

[13]  Darrell D. E. Long,et al.  Understanding data survivability in archival storage systems , 2012, SYSTOR '12.

[14]  Reza Salkhordeh,et al.  ReCA: An Efficient Reconfigurable Cache Architecture for Storage Systems with Online Workload Characterization , 2018, IEEE Transactions on Parallel and Distributed Systems.

[15]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[16]  James Lee Hafner,et al.  Reliability for Networked Storage Nodes , 2011, IEEE Transactions on Dependable and Secure Computing.

[17]  James S. Plank,et al.  The Raid-6 Liber8Tion Code , 2009, Int. J. High Perform. Comput. Appl..

[18]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[19]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[20]  E. D. S. E. Silva,et al.  Transient Solutions for Markov Chains , 2000 .

[21]  Balbir S. Dhillon System Reliability Evaluation Models with Human Error , 1983, IEEE Transactions on Reliability.

[22]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[23]  Jon G. Elerath A simple equation for estimating reliability of an N+1 redundant array of independent disks (RAID) , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[24]  Alan D. Swain,et al.  Human reliability analysis: Need, status, trends and limitations , 1990 .

[25]  A. D. Swain,et al.  Handbook of human-reliability analysis with emphasis on nuclear power plant applications. Final report , 1983 .

[26]  Ilias Iliadis,et al.  A General Reliability Model for Data Storage Systems , 2012, 2012 Ninth International Conference on Quantitative Evaluation of Systems.

[27]  Jiri Schindler,et al.  Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation , 2014, TOS.

[28]  Jeremy MG Taylor,et al.  Robust Statistical Modeling Using the t Distribution , 1989 .

[29]  Jehoshua Bruck,et al.  EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[30]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[31]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[32]  Michael G. Pecht,et al.  Enhanced Reliability Modeling of RAID Storage Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[33]  Hossein Pedram,et al.  HVD: horizontal-vertical-diagonal error detecting and correcting code to protect against with soft errors , 2011, Des. Autom. Embed. Syst..

[34]  Kevin M. Greenan,et al.  Reliability and power-efficiency in erasure-coded storage systems , 2009 .

[35]  Alma Riska,et al.  Enhancing data availability in disk drives through background activities , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[36]  Jiwu Shu,et al.  GRID codes: Strip-based erasure codes with high fault tolerance for storage systems , 2009, TOS.

[37]  Garth A. Gibson Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis , 1990 .

[38]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[39]  Hamid Sarbazi-Azad,et al.  A Hybrid Non-Volatile Cache Design for Solid-State Drives Using Comprehensive I/O Characterization , 2016, IEEE Transactions on Computers.

[40]  Hossein Asadi,et al.  Evaluating impact of human errors on the availability of data storage systems , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[41]  Mingqiang Li,et al.  STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures , 2014, TOS.

[42]  T.P McWilliams,et al.  Human Error Considerations in Determining the Optimum Test Interval for Periodically Inspected Standby Systems , 1980, IEEE Transactions on Reliability.

[43]  María Dolores Berrade,et al.  Some Insights Into the Effect of Maintenance Quality for a Protection System , 2015, IEEE Transactions on Reliability.

[44]  Elizabeth Haubert Threats of Human Error in a High-Performance Storage System: Problem Statement and Case Study , 2004, ArXiv.

[45]  L. Kohn,et al.  To Err Is Human : Building a Safer Health System , 2007 .

[46]  Xiaozhou Li,et al.  Reliability analysis of deduplicated and erasure-coded storage , 2011, PERV.

[47]  Pin Zhou,et al.  Evaluating the impact of Undetected Disk Errors in RAID systems , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[48]  Wayne Nelson,et al.  Applied life data analysis , 1983 .

[49]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[50]  Jon G. Elerath RAID-6 system reliability dependence on recovery, disk scrubbing, and group size , 2016, 2016 Annual Reliability and Maintainability Symposium (RAMS).

[51]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[52]  Michael G. Pecht,et al.  A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) , 2009, IEEE Transactions on Computers.

[53]  Hossein Asadi,et al.  Operating system level data tiering using online workload characterization , 2015, The Journal of Supercomputing.

[54]  Ahmed Amer,et al.  Protecting RAID Arrays against Unexpectedly High Disk Failure Rates , 2014, 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing.

[55]  Onur Mutlu,et al.  ECI-Cache: A High-Endurance and Cost-Efficient I/O Caching Scheme for Virtualized Platforms , 2018, SIGMETRICS.