Fault-Tolerance in Massively Parallel Systems

In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion of long-running computation-intensive applications . To achieve this at reasonable low cost, we present a global approach . A flexible and powerful backbone is provided through the combination ofhardware and software error detection techniques, fault diagnosis and operator-site software together with reconfiguration of the system . Application recovery is based on checkpointing and rollback . The red line (i.e. applicability for a massively parallel system) comprises scalability as well as simplicity. A unifying system model is introduced that allows the mapping of a global concept for fault tolerance to a wide variety of MPS. The framework for implementation in an existing MPS is discussed .'