论文信息 - Characterizing the Dependability of Distributed Storage Systems Using a Two-Layer Hidden Markov Model-Based Approach

Characterizing the Dependability of Distributed Storage Systems Using a Two-Layer Hidden Markov Model-Based Approach

Nowadays, dependability is of paramount importance in modern distributed storage systems. A challenging issue to deploy a storage system with certain dependability requirements or improve existing systems' dependability is how to comprehensively and efficiently characterize the dependability of those systems. In this paper, we present a two-layer Hidden Markov Model (HMM) to characterize the dependability of a distributed storage system, focusing on the layer of parallel file system. By training the model with observable measurements under faulty scenarios, such as I/O performance, we quantify the system dependability via a tuple of state transition probability, service degradation, and fault latency under those scenarios. Our experimental results on a distributed storage system with PVFS (Parallel Virtual File System) demonstrate the effectiveness of our HMM-based approach, which efficiently captures the behavior patterns of the target system under disk faults and memory overusage.

[1] Andrea C. Arpaci-Dusseau,et al. An analysis of data corruption in the storage stack , 2008, TOS.

[2] Gernot A. Fink,et al. Markov Models for Pattern Recognition: From Theory to Applications , 2007 .

[3] J. Sikora. Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[4] Andrea Bondavalli,et al. Hidden Markov Models as a Support for Diagnosis: Formalization of the Problem and Synthesis of the Solution , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[5] Van Nostrand,et al. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[6] Eduardo Pinheiro,et al. Failure Trends in a Large Disk Drive Population , 2007, FAST.

[7] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[8] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .

[9] Miroslaw Malek,et al. Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[10] L. Baum,et al. Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[11] Nikolaos Limnios,et al. Semi-Markov Chains and Hidden Semi-Markov Models toward Applications: Their Use in Reliability and DNA Analysis , 2008 .

[12] Shankar Pasupathy,et al. An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[13] Scott A. Brandt,et al. Reliability mechanisms for very large storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[14] David A. Patterson,et al. Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[15] Brian D. Noble,et al. Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[16] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17] Tadashi Dohi,et al. Dependability analysis of a client/server software system with rejuvenation , 2002, 13th International Symposium on Software Reliability Engineering, 2002. Proceedings..

[18] L. Baum,et al. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[19] Kishor S. Trivedi,et al. Ten Fallacies of Availability and Reliability Analysis , 2008, ISAS.

[20] Brian Randell,et al. Fundamental Concepts of Dependability , 2000 .

[21] Srinivasan Seshan,et al. Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems , 2006, NSDI.

[22] Arkady Kanevsky,et al. Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[23] Suman Nath,et al. Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.