论文信息 - Reliability Markov models are becoming unreliable ( WIP submission )

Reliability Markov models are becoming unreliable ( WIP submission )

Markov models have traditionally been used to understand the reliability of storage systems. They provide intuition about the sensitivity of storage system reliabilit y to changes in disk failure rates, rebuild rates, sector fail ure rates, scrubbing rates, and storage capacity. Unfortunately, as we move towards multi-disk fault tolerant storage systems, i.e., storage systems that tolerate two or more disk failures such as RAID 6, reliability estimates based on traditional Markov models become unreliable. Our concerns go beyond the recent demonstration that Weibull distributions need to be used instead of exponential distributions to accurately determine storage system reliability [1]. We believe that the traditional construction of Markov models is flawed for multi-disk fault tolerant systems, and that their accuracy and utility decreases as the redundancy in the system increases. In this WIP, we will only discuss one of our concerns: modeling disk rebuild correctly. Two traditional Markov models are used to model two distinct storage rebuild policies. In aserial rebuild policy, a storage system rebuilds the first failed disk in its entirety before rebuilding the ne xt failed disk, and so on. In a concurrent rebuild policy, a storage system begins rebuilding failed disks as they fail. Figure 1 illustrates the two traditional Markov models for an n disk system that tolerates m disk failures. The label of each state indicates the number of failed disks; statem + 1 is the data loss state. The transitions from left to right are disk failures, withλ being the failure rate. The transitions from right to left are disk rebuilds, with μ being the rebuild rate. For single disk fault tolerant systems, the serial and concurrent rebuild models are identical, and are correct. For multi-disk fault tolerant systems, both rebuild models are incorrect. The same modeling error is made in each case. The rebuild transitions for states 2 through m are incorrect: they model the rebuild of the disk that failed most recently, whereas reliability is dominated by the rebuild of the disk that failed earliest. In essence, traditio nal Markov modelsreset the rebuild time for all disks being rebuilt whenever another disk fails. The traditional serial rebuild Markov model thus models a rebuild policy 0 1 2 (n)

K. Greenan

[1] Michael G. Pecht,et al. Enhanced Reliability Modeling of RAID Storage Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).