Software implementation of a recursive fault tolerance algorithm on a network of computers

RAFT is a recursive algorithm for fault tolerance that uses a combination of dynamic space and time redundancy techniques for detecting faulty processors and recovering from errors. U * is a multicomputer testbed consisting of a network of AT&T 3B2 computers running a network operating system based on the UNIX system. This paper describes a software implementation of RAFT on U * , and demonstrates the effectiveness of a RAFT-like scheme for designing fault-tolerant multicomputer systems. Results of Monte Carlo experiments, conducted on this system that validated the theoretical basis of RAFT, are presented. Experimentally observed performance penalty, incurred due to fault tolerance, is also presented.