System performance in a failure prone environment

This dissertation treats in some detail the behavior of systems in a failure prone environment. The past work on this topic has been centered primarily on the behavior of queueing systems with servers that are subject to interruptions. Such queueing models assume that each customer brings along some unstructured work requirement. We extend the past work by considering customers that may have structured work requirements which can, in some cases, achieve better performance in random environments than an unstructured job through the use of checkpointing. We show that this is the case by a simple server model and block structured programs and then show how this can be extended to other program structures. We analyze a queueing system with structured work requirements, checkpointing, work loss, and server interruptions. Next we extend past work by allowing server models that are significantly more complex that handle mixed types of interruptions and are non-Markovian in their transitions. We obtain completion time distribution transforms for these models with a wide variety of work requirement distributions. We also extend previous analyses of the M/G/1 queue to such servers. Then we move on to concurrent programming environments. First we look at scheduling when there is failure or failure and repair of the processors and when the work requirement is structured as an in-tree. There we extend some previous results by showing that HLF is optimal in a failure prone environment. Our extensions show that HLF optimizes the makespan under extremely general failure and repair conditions. Then we develop a model of the distribution of completion time of a graph-structured work requirement when there are multiple processors using the PS scheduling discipline. We combine numerical inversion of the Laplace transform with numerical integration to recover distribution functions from completion time and response time transforms developed in the dissertation. In recovering the distribution functions we show how several numerical techniques can be combined to reduce the numerical error and obtain accurate results.