An Extended Atomic Consistency Protocol for Recoverable DSM Systems

This paper describes a new checkpoint recovery protocol for Distributed Shared Memory (DSM) systems with read-write objects. It is based on independent checkpointing integrated with a coherence protocol for the atomic consistency model. The protocol offers high availability of shared objects in spite of multiple node and communication failures, introducing little overhead. It ensures fast recovery in case of multiple node failures and enables a DSM system to circumvent the network partitioning, as far as a majority partition can be constituted. A formal proof of correctness of the protocol is also presented.

[1]  Angelos Bilas,et al.  Dynamic data replication: an approach to providing fault-tolerant shared memory clusters , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[2]  Anne-Marie Kermarrec,et al.  An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures , 2000, IEEE Trans. Computers.

[3]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[4]  Heon Young Yeom,et al.  A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory Systems , 2004, The Journal of Supercomputing.

[5]  Liviu Iftode,et al.  Scalable Fault-Tolerant Distributed Shared Memory , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[6]  Nian-Feng Tzeng,et al.  Coherence-based coordinated checkpointing for software distributed shared memory systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[7]  Jerzy Brzezinski,et al.  Replication of Checkpoints in Recoverable DSM Systems , 2003, Applied Informatics.

[8]  Michael Stumm,et al.  Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[9]  Jerzy Brzeziński,et al.  PORSHE --- a Reliable Object-Based DSM System , 2000 .