论文信息 - Fault-Tolerance in Massively Parallel Systems

Fault-Tolerance in Massively Parallel Systems

In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion of long-running computation-intensive applications . To achieve this at reasonable low cost, we present a global approach . A flexible and powerful backbone is provided through the combination ofhardware and software error detection techniques, fault diagnosis and operator-site software together with reconfiguration of the system . Application recovery is based on checkpointing and rollback . The red line (i.e. applicability for a massively parallel system) comprises scalability as well as simplicity. A unifying system model is introduced that allows the mapping of a global concept for fault tolerance to a wide variety of MPS. The framework for implementation in an existing MPS is discussed .'

[1] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[2] P. Sander. Decision and estimation theory , 1980 .

[3] Daniel P. Siewiorek,et al. Workload, Performance, and Reliability of Digital Computing Systems. , 1980 .

[4] Richard B. Tan,et al. Routing with compact routing tables , 1983 .

[5] Nicola Santoro,et al. Labelling and Implicit Routing in Networks , 1985, Computer/law journal.

[6] Ravishankar K. Iyer,et al. A Measurement-Based Model for Workload Dependence of CPU Errors , 1986, IEEE Transactions on Computers.

[7] David B. Johnson,et al. Sender-Based Message Logging , 1987 .

[8] Taesoon Park,et al. Checkpointing and rollback-recovery in distributed systems , 1989 .

[9] Daniel P. Siewiorek,et al. Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[10] Erik Maehle,et al. DELTA-T: A user-transparent software-monitoring tool for multi-transputer systems , 1992, Microprocess. Microprogramming.

[11] Johan Vounckx,et al. Network fault tolerance with interval routing devices , 1993 .

[12] Geert Deconinck,et al. Reconfiguration and Checkpointing in Massively Parallel Systems , 1994, EDCC.