Rollback Recovery in Multiprocessor Ring Configurations

This paper describes a technique for distributed recovery in multiprocessor ring configurations, which has been developed and implemented for the multiprocessor system DIRMU 25 — a 25 processor system which is operational at the University of Erlangen-Nuremberg. First a short overview of the DIRMU hardware architecture and the distributed operating system DIRMOS is given. The steps of distributed recovery using distributed system checkpoints are described. By measurement of the runtime overhead of a realistic application (2D-Poisson-multigrid) its efficiency is discussed in comparasion to recovery techniques using central system checkpoints.