Characterizing the Dependability of Distributed Storage Systems Using a Two-Layer Hidden Markov Model-Based Approach

Nowadays, dependability is of paramount importance in modern distributed storage systems. A challenging issue to deploy a storage system with certain dependability requirements or improve existing systems' dependability is how to comprehensively and efficiently characterize the dependability of those systems. In this paper, we present a two-layer Hidden Markov Model (HMM) to characterize the dependability of a distributed storage system, focusing on the layer of parallel file system. By training the model with observable measurements under faulty scenarios, such as I/O performance, we quantify the system dependability via a tuple of state transition probability, service degradation, and fault latency under those scenarios. Our experimental results on a distributed storage system with PVFS (Parallel Virtual File System) demonstrate the effectiveness of our HMM-based approach, which efficiently captures the behavior patterns of the target system under disk faults and memory overusage.

[1]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[2]  Gernot A. Fink,et al.  Markov Models for Pattern Recognition: From Theory to Applications , 2007 .

[3]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[4]  Andrea Bondavalli,et al.  Hidden Markov Models as a Support for Diagnosis: Formalization of the Problem and Synthesis of the Solution , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[5]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[6]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[7]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[8]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[9]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[10]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[11]  Nikolaos Limnios,et al.  Semi-Markov Chains and Hidden Semi-Markov Models toward Applications: Their Use in Reliability and DNA Analysis , 2008 .

[12]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[13]  Scott A. Brandt,et al.  Reliability mechanisms for very large storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[14]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[15]  Brian D. Noble,et al.  Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  Tadashi Dohi,et al.  Dependability analysis of a client/server software system with rejuvenation , 2002, 13th International Symposium on Software Reliability Engineering, 2002. Proceedings..

[18]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[19]  Kishor S. Trivedi,et al.  Ten Fallacies of Availability and Reliability Analysis , 2008, ISAS.

[20]  Brian Randell,et al.  Fundamental Concepts of Dependability , 2000 .

[21]  Srinivasan Seshan,et al.  Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems , 2006, NSDI.

[22]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[23]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.