A two-level checkpoint algorithm in a highly-available parallel single level store system

A parallel single level store system (PSLS) integrates a shared virtual memory and a parallel file system. Managing the data globally it provides programmers of scientific applications with the attractive shared memory programming model combined with a large and efficient file system in a cluster. We present a cheap and efficient two-level checkpointing approach enabling a PSLS to tolerate failures. The first level checkpointing algorithm is very efficient and saves data in memory but requires a large amount of memory space. When memories are saturated, an alternative algorithm, saving a checkpoint on disks is implemented. Performance results present the impact of different variants of the checkpointing algorithms.

[1]  Anne-Marie Kermarrec,et al.  Design, implementation and evaluation of ICARE: an efficient recoverable DSM , 1998, Softw. Pract. Exp..

[2]  David B. Gustavson The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.

[3]  Michael Stumm,et al.  Exploiting Mapped Files for Parallel I/O , 1995 .

[4]  Qun Li,et al.  BFXM: a parallel file system model based on the mechanism of distributed shared memory , 1997, OPSR.

[5]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[6]  David Kotz,et al.  The galley parallel file system , 1997, ICS '96.

[7]  Evangelos P. Markatos,et al.  Adaptive and Reliable Paging to Remote Main Memory , 1999, J. Parallel Distributed Comput..

[8]  David A. Patterson,et al.  Serverless network file systems , 1995, SOSP.

[9]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[10]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[11]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[12]  Andrew A. Chien,et al.  PPFS: a high performance portable parallel file system , 1995, ICS '95.

[13]  Christine Morin,et al.  Global Resource Management for High Availability and Performance in a DSM-based Cluster , 1998 .

[14]  Christine Morin,et al.  A Survey of Recoverable Distributed Shared Virtual Memory Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[15]  Christine Morin,et al.  A Survey of Recoverable Distributed Shared Memory Systems , 1995 .