An Introduction to PAKDD CUP 2020 Dataset

With the rapid development of cloud services, disk storage has played an important role in large-scale production cloud systems. Predicting imminent disk failures is critical for maintaining data reliability. Our vision is that it is important for researchers to contribute to the development of new techniques for accurate and robust disk failure prediction. If researchers can discover any reasonable approaches for disk failure prediction in large-scale cloud systems, all IT and big data companies can benefit from such approaches to further enhance the robustness of the production cloud systems. With this vision in mind, we have published an open labeled dataset that spans a period of 18 months with a total of 220,000 hard drives collected from Alibaba Cloud. Our dataset is among the largest released in the community in terms of its scale and duration. To better understand our dataset, we present our dataset generation process and conduct a preliminary analysis on the characteristics of our dataset. Our open dataset has been adopted in the PAKDD2020 Alibaba AI Ops Competition, in which contestants proposed new disk failure prediction algorithms through the analysis and evaluation of the dataset.

[1]  Sophie Chabridon,et al.  Predictive Models of Hard Drive Failures Based on Operational Data , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[2]  Chiranjib Bhattacharyya,et al.  Discovering Rules from Disk Events for Predicting Hard Drive Failures , 2009, 2009 International Conference on Machine Learning and Applications.

[3]  Jiesheng Wu,et al.  Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures , 2019, USENIX Annual Technical Conference.

[4]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[5]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[6]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[7]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[8]  Emin Gün Sirer,et al.  Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication , 2015, USENIX Annual Technical Conference.

[9]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[10]  Weisong Shi,et al.  Making Disk Failure Predictions SMARTer! , 2020, FAST.

[11]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[12]  Hai Jin,et al.  Disk Failure Prediction in Data Centers via Online Learning , 2018, ICPP.

[13]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[14]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[15]  Bianca Schroeder,et al.  Proactive error prediction to improve storage system reliability , 2017, USENIX ATC.

[16]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[17]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[18]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[19]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[20]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[21]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.