Load sharing in distributed systems with failures

SummaryAn approximate model is presented for the mean response time in a distributed computer system in which components may fail. Each node in the system periodically performs a checkpoint, and also periodically tests the other nodes to determine whether they are failed or not. When a node fails, it distributes its workload to other nodes which appear to be operational, based on the results of its most recent test. An approximate response time model is developed, explicitly allowing for the delays caused by transactions being incorrectly transferred to failed nodes, because of out-of-date testing results. For the case when all nodes are identical, a closed form solution is derived for the optimal testing rate minimizing the average response time. Numerical results are presented illustrating the relationships among the problem parameters.