Dynamic snooping in a fault-tolerant distributed shared memory

Distributed shared memory (DSM) allows multicomputer systems with no physically shared memory to be programmed using a shared memory paradigm. However, as the number of nodes in a system increases the probability of a failure that can corrupt the DSM increases. This paper presents a fault-tolerant DSM (FTDSM) algorithm that can tolerate single node failures. Each page in the DSM is assigned a snooper that keeps a backup copy of the page and can take over if the page owner fails. The snooper is dynamic because the responsibility for snooping a page can migrate front node to node. The FTDSM presented is an improvement over other FTDSMs because it is scalable, is based on the efficient dynamic distributed manager (DDM) DSM algorithm, does not require the repair of a failed processor to access the DSM, and does not query all nodes to rebuild the state of the DSM. It is shown that any single node failure can be tolerated because either the owner or the snooper of a page can always be found.<<ETX>>

[1]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[2]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[3]  Michael Stumm,et al.  Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[4]  Bill Nitzberg,et al.  Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[5]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[6]  Rana Ejaz Ahmed,et al.  Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[7]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[8]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[9]  Meichun Hsu,et al.  Fast recovery in distributed shared virtual memory systems , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.