Simulation and analysis of highly reliable systems

In this thesis we investigate simulation algorithms and numerical approximations for estimating performance measures for some large classes of highly reliable systems. We start with the study of analytical approximations for a fundamental problem in reliability theory that is known to be computationally intractable (NP-hard): the standard network reliability problem with highly reliable independent components. We extrapolate these ideas to Markovian systems with highly reliable interacting components. The special structure of the generator matrix is used to derive limit theorems and efficient approximations for some special classes of such systems. For more general systems with complex interdependencies among components, we have to resort to simulation techniques. In particular, we investigate simulation techniques for a class of systems that are considered by the "SAVE" package. The SAVE ("Systems Availability Estimator") package is a state of the art software package being developed at the IBM T. J. Watson Research Center for the performance analysis of highly reliable systems. In the SAVE package the component failure times and component repair times are assumed to be exponentially distributed so that the systems may be modelled as continuous time Markov chains. Despite the Markovian structure, naive simulation is very inefficient due to the rarity of the failure events. An importance sampling technique called failure biasing has been known empirically to produce orders of magnitude of variance reduction in the simulation of some such systems. We modify this technique to make it both more robust and applicable to a broader class of systems. We develop a mathematical framework within which we can prove that our modified failure biasing technique yields a rate of convergence that is insensitive to the component failure rates. Another way of making a system highly reliable is to have large degrees of component redundancy. The components are permitted to have generally distributed repair times. We derive exact analytical formulas for the performance measures of some systems with components in parallel. We also investigate importance sampling techniques, based on large deviation theory, that can be used to efficiently simulate such systems.