Proactive error prediction to improve storage system reliability

This paper proposes the use of machine learning techniques to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through redundancy in the system (e.g. another drive in the same RAID) and is lost if the error is encountered while the system operates in degraded mode, e.g. during RAID reconstruction. In this paper, we explore a range of different machine learning techniques and show that sector errors can be predicted ahead of time with high accuracy. Prediction is robust, even when only little training data or only training data for a different drive model is available. We also discuss a number of possible use cases for improving storage system reliability through the use of sector error predictors. We evaluate one such use case in detail: We show that the mean time to detecting errors (and hence the window of vulnerability to data loss) can be greatly reduced by adapting the speed of a scrubber based on error predictions.

[1]  Ari Juels,et al.  A Clean-Slate Look at Disk Scrubbing , 2010, FAST.

[2]  Angela Demke Brown,et al.  Opportunistic storage maintenance , 2015, SOSP.

[3]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[4]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[5]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[6]  Jie Liu,et al.  SSD Failures in Datacenters: What? When? and Why? , 2016, SYSTOR.

[7]  Cheng-Wen Wu,et al.  An Adaptive-Rate Error Correction Scheme for NAND Flash Memory , 2009, 2009 27th IEEE VLSI Test Symposium.

[8]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[9]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Moisés Goldszmidt Finding Soon-to-Fail Disks in a Haystack , 2012, HotStorage.

[11]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[12]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[13]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[14]  Bianca Schroeder,et al.  Practical scrubbing: Getting to the bad sector at the right time , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[15]  Ahmed Amer,et al.  Improving Disk Array Reliability Through Expedited Scrubbing , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[16]  Bianca Schroeder,et al.  Improving Storage System Reliability with Proactive Error Prediction , 2017 .

[17]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[18]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  Robert Cypher,et al.  Disks for Data Centers , 2016 .

[20]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.