Improving Storage System Reliability with Proactive Error Prediction

This paper proposes using techniques from machine learning to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through external forms of redundancy (e.g. another drive in the same RAID), and be lost if the error is encountered while the system operates in degraded mode, e.g. during RAID reconstruction. In this paper, we explore a range of different machine learning techniques and show that sector errors can be predicted ahead of time with high accuracy. Prediction is robust, even when only little training data or only training data for a different drive model is available. We also discuss a number of possible use cases for improving storage system reliability through the use of sector error predictors. We evaluate one such use case in detail: We show that the mean time to detecting errors (and hence the window of vulnerability to data loss) can be greatly reduced by adapting the speed of a scrubber based on error predictions.

[1]  Angela Demke Brown,et al.  Opportunistic storage maintenance , 2015, SOSP.

[2]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[3]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[4]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[5]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[6]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[8]  Moisés Goldszmidt Finding Soon-to-Fail Disks in a Haystack , 2012, HotStorage.

[9]  Ari Juels,et al.  A Clean-Slate Look at Disk Scrubbing , 2010, FAST.

[10]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[11]  Ahmed Amer,et al.  Improving Disk Array Reliability Through Expedited Scrubbing , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[12]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[13]  Cheng-Wen Wu,et al.  An Adaptive-Rate Error Correction Scheme for NAND Flash Memory , 2009, 2009 27th IEEE VLSI Test Symposium.

[14]  Robert Cypher,et al.  Disks for Data Centers , 2016 .

[15]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[16]  Jie Liu,et al.  SSD Failures in Datacenters: What? When? and Why? , 2016, SYSTOR.

[17]  Bianca Schroeder,et al.  Practical scrubbing: Getting to the bad sector at the right time , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[18]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[19]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..