论文信息 - A two-level checkpoint algorithm in a highly-available parallel single level store system

A two-level checkpoint algorithm in a highly-available parallel single level store system

A parallel single level store system (PSLS) integrates a shared virtual memory and a parallel file system. Managing the data globally it provides programmers of scientific applications with the attractive shared memory programming model combined with a large and efficient file system in a cluster. We present a cheap and efficient two-level checkpointing approach enabling a PSLS to tolerate failures. The first level checkpointing algorithm is very efficient and saves data in memory but requires a large amount of memory space. When memories are saturated, an alternative algorithm, saving a checkpoint on disks is implemented. Performance results present the impact of different variants of the checkpointing algorithms.

Anne-Marie Kermarrec | Christine Morin | Renaud Lottiaux

[1] Anne-Marie Kermarrec,et al. Design, implementation and evaluation of ICARE: an efficient recoverable DSM , 1998, Softw. Pract. Exp..

[2] David B. Gustavson. The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.

[3] Michael Stumm,et al. Exploiting Mapped Files for Parallel I/O , 1995 .

[4] Qun Li,et al. BFXM: a parallel file system model based on the mechanism of distributed shared memory , 1997, OPSR.

[5] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[6] David Kotz,et al. The galley parallel file system , 1997, ICS '96.

[7] Evangelos P. Markatos,et al. Adaptive and Reliable Paging to Remote Main Memory , 1999, J. Parallel Distributed Comput..

[8] David A. Patterson,et al. Serverless network file systems , 1995, SOSP.

[9] Richard D. Schlichting,et al. Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[10] Paul Hudak,et al. Memory coherence in shared virtual memory systems , 1986, PODC '86.

[11] Alan L. Cox,et al. TreadMarks: shared memory computing on networks of workstations , 1996 .

[12] Andrew A. Chien,et al. PPFS: a high performance portable parallel file system , 1995, ICS '95.

[13] Christine Morin,et al. Global Resource Management for High Availability and Performance in a DSM-based Cluster , 1998 .

[14] Christine Morin,et al. A Survey of Recoverable Distributed Shared Virtual Memory Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[15] Christine Morin,et al. A Survey of Recoverable Distributed Shared Memory Systems , 1995 .