A Scalable Implementation of Fault Tolerance for Massively Parallel Systems

This project is partly sponsored by ESPRIT project 6731 (FTMPS): "Fault Tolerance in Massively Parallel Systems" . Geert Deconinck and Johan Vounckx have a grant from the Flemish Institute for the Advancement of Scientific and Technological Research in Industry (IWT). Rudy Lauwereins is a Senior Research Associate of the Belgian Fund for Scientific Research .

[1]  Rudy Lauwereins,et al.  The FTMPS-Project: Design and Implementation of Fault-Tolerance Techniques for Massively Parallel Systems , 1994, HPCN.

[2]  Jan van Leeuwen,et al.  Interval Routing , 1987, Computer/law journal.

[3]  Rudy Lauwereins,et al.  Minimal Deadlock-Free Compact Routing in Wormhole Switching based Injured Meshes , 1995 .

[4]  Rudy Lauwereins,et al.  A Loader for Injured Massively Parallel Networks , 1995 .

[5]  Rudy Lauwereins,et al.  The Consistent File-Status in a User-Triggered Checkpointing Approach , 1995, PARCO.

[6]  Jörn Altmann,et al.  On integrating error detection into a fault diagnosis algorithm for massively parallel computers , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[7]  Jörn Altmann,et al.  An Event-Driven Approach to Multiprocessor Diagnosis , 1994 .

[8]  Henrique Madeira,et al.  Xception: Software Fault Injection and Monitoring in Processor Functional Units1 , 1995 .

[9]  Johan Vounckx,et al.  Reconfiguration and Checkpointing in Massively Parallel Systems , 1994, EDCC.

[10]  Rudy Lauwereins,et al.  A User-triggered Checkpointing Library for Computationintensive Applications , 1995, Parallel and Distributed Computing and Systems.

[11]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[12]  Jörn Altmann,et al.  An Approach for Hierarchical System Level Diagnosis of Massively Parallel Computers Combined with a Simulation-Based Method for Dependability Analysis , 1994, EDCC.