Measurement-based dependability analysis and modeling for multicomputer systems

This research addresses issues in measurement-based dependability analysis, modeling and evaluation for multicomputer systems. Measurements are made on two DEC VAXclusters for substantial periods to collect error data. Dependability analysis, modeling and evaluation are performed based on these data. Results include new findings from measurements, new models validated via real data, and new techniques implemented in a tool developed in the research process. A methodology for measurement-based dependability analysis and modeling is developed. The methodology has been implemented in a software package--MEASURE+. Given a set of data measured from a real system in a specified format, MEASURE+ can generate commonly used dependability models and measures incorporating correlations. These models and measures are accurate reflections of real system behavior. They are valuable for understanding actual error/failure characteristics, identifying system bottlenecks, evaluating dependability for real systems, and verifying assumptions made in analytical models. An important result obtained is that correlated failures (failures involving multiple machines in a small time window) are not negligible in the measured systems. This finding is contrary to the assumption typically made in analytical models that failures on different components are independent. The issue of modeling correlated failures has not been well addressed on the basis of real systems. At this stage, there is little knowledge about model construction and parameters to specify correlations. It is shown that even a small correlation can have a significant impact on availability, reliability, and transient performance. Traditional analytical models that assume failure independence overestimate dependability measures by orders of magnitude for the measured systems. Further, the analysis shows that a few analytical models believed to take correlation into account do not reflect the actual process in which correlated failures occur. In an effort to study this issue, several new dependability models are introduced to evaluate systems with correlated failures. The models are validated with real data.