Processor failure recovery for a resource sharing algorithm

With the increase in popularity of distributed computer systems, the reliability of the system as a whole is becoming more important. A recently published combined resource sharing algorithm showed how the atomic operations required for resource management in a closely coupled multiprocessor system could be provided. The paper describes a recovery system that may be incorporated within the earlier algorithm to enable continued and correct operation of the system despite the failure of one or more component processors. A distributed simulation of the recovery mechanism is described and results from simulation runs are presented.