Relaxing consistency in recoverable distributed shared memory

Relaxed memory consistency models tolerate increased memory access latency in both hardware and software distributed shared memory systems. In recoverable systems, relaxing consistency has the added benefit of reducing the number of checkpoints needed to avoid rollback propagation. The authors introduce new checkpointing algorithms that take advantage of relaxed consistency to reduce the performance overhead of checkpointing. They also introduce a scheme based on lazy relaxed consistency that reduces both checkpointing overhead and the overhead of avoiding error propagation in systems with error latency. They use multiprocessor address traces to evaluate the relaxed consistency approach to checkpointing with distributed shared memory.

[1]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[2]  Michael Stumm,et al.  Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[3]  G. Zorpette,et al.  Supercomputers-the power of parallelism , 1992 .

[4]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[5]  Kang G. Shin,et al.  Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks , 1984, IEEE Transactions on Computers.

[6]  B. D. Fleisch Reliable distributed shared memory , 1990, IEEE Workshop on Experimental Distributed Systems.

[7]  Dhiraj K. Pradhan,et al.  Virtual Checkpoints: Architecture and Performance , 1992, IEEE Trans. Computers.

[8]  Meichun Hsu,et al.  Fast recovery in distributed shared virtual memory systems , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[9]  LiKai,et al.  Memory coherence in shared virtual memory systems , 1989 .

[10]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[11]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[12]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[13]  Prithviraj Banerjee,et al.  PHIGURE: a parallel hierarchical global router , 1990, 27th ACM/IEEE Design Automation Conference.

[14]  W. Kent Fuchs,et al.  Address tracing of parallel systems via TRAPEDS , 1992, Microprocess. Microsystems.

[15]  Srinivas Patil,et al.  Parallel algorithms for test generation and fault simulation , 1991 .

[16]  Janak H. Patel,et al.  Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..

[17]  Ronald Minnich,et al.  Reducing host load, network load, and latency in a distributed shared memory , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[18]  Lothar Borrmann,et al.  A Coherency Model for Virtually Shared Memory , 1990, ICPP.

[19]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[20]  W. Kent Fuchs,et al.  Experimental Evaluation of Multiprocessor Cache-Based Error Recovery , 1991, ICPP.

[21]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[22]  David Mosberger,et al.  Memory consistency models , 1993, OPSR.

[23]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[24]  Rana Ejaz Ahmed,et al.  Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[25]  Krishna P. Belkhale Parallel Algorithms for Computer Aided Design with Applications to Circuit Extraction , 1990 .

[26]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[27]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[28]  Marc Tremblay,et al.  High-Performance Fault-Tolerant VLSI Systems Using Micro Rollback , 1990, IEEE Trans. Computers.

[29]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.

[30]  Lothar Borrmann,et al.  Store Coherency in a Parallel Distributed-Memory Machine , 1991, EDMCC.

[31]  Michel Banâtre,et al.  Cache management in a tightly coupled fault tolerant multiprocessor , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.