论文信息 - Consistent state restoration in shared memory systems

Consistent state restoration in shared memory systems

In many systems, backward recovery constitutes a classical technique to ensure fault-tolerance. It consists in restoring a computation in a consistent global state, saved in a global checkpoint, from which this computation can be resumed. A global checkpoint includes a set of local checkpoints, one from each process which correspond to local states dumped onto stable storage. In this paper we are interested in defining formally the domino effect for shared memory systems be the shared memory a physical one (as in multiprocessor systems) or a virtual one (as in distributed shared memory systems) and in designing a domino-free adaptive algorithm. These results lie on a necessary and sufficient condition which shows when a set of local checkpoints can belong to some consistent global checkpoint.

Achour Mostéfaoui | Michel Raynal | Roberto Baldoni | Jean-Michel Hélary

[1] Jayadev Misra. Axioms for memory access in asynchronous hardware systems , 1986, TOPL.

[2] Manhoi Choy,et al. On distributed object checkpointing and recovery , 1995, PODC '95.

[3] Marc Shapiro,et al. Generic virtual memory management for operating system kernels , 1989, SOSP '89.

[4] Michel Raynal,et al. Sequential Consistency in Distributed Systems , 1994, Dagstuhl Seminar on Distributed Systems.

[5] André Schiper,et al. From Causal Consistency to Sequential Consistency in Shared Memory Systems , 1995, FSTTCS.

[6] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[7] Jian Xu,et al. Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[8] David L. Russell,et al. State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[9] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[10] LiKai,et al. Memory coherence in shared virtual memory systems , 1989 .

[11] Brian Randell. System structure for software fault tolerance , 1975 .

[12] Michel Raynal,et al. About state recording in asynchronous computations , 1996, PODC '96.