Fault tolerance via N-modular software redundancy

Presents a novel method of "indirect" software instrumentation to achieve fault tolerance at the application level. Error detection and recovery are based on the well-known approach of replicating application processes on multiple computers in a network. The advantages of this fault tolerance scheme based on indirect instrumentation include: (1) a general error detection method that ensures data integrity for critical data without the need for any modification of the code, (2) a high degree of automation and transparency for fault-tolerant configuration and operation (i.e. the set-up time for a new application is on the order of a few minutes), and (3) the ability to perform error detection for applications for which no source code or only minimal knowledge of the code is available, including legacy applications. The types of faults that are tolerated include transient and permanent hardware faults on a single machine and certain types of application and operating system software faults.