A symmetric O(n log n) message distributed snapshot algorithm for large-scale systems

This paper presents a O(n log n) message distributed snapshot algorithm for a system with non-FIFO channels, where n is the number of processors. The algorithm finds applications for checkpointing in large scale supercomputers and distributed systems that have a fully connected logical topology over a large number of processors. Each processor sends log n messages in the algorithm. The sizes of the messages are geometrically distributed, and the sum of the sizes of the messages sent by any processor is n. The response time of the algorithm is O(log n). The algorithm is fully distributed and the role of each processor is symmetric, unlike tree-based, ring-based, and centralized algorithms.

[1]  Vijay K. Garg,et al.  Scalable algorithms for global snapshots in distributed systems , 2006, ICS '06.

[2]  Patrick Th. Eugster,et al.  Taking Snapshots of Virtual Networked Environments , 2007, Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing (VTDC '07).

[3]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..

[4]  F. Cappello,et al.  Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[5]  Ajay D. Kshemkalyani,et al.  Detecting Arbitrary Stable Properties Using Efficient Snapshots , 2007, IEEE Transactions on Software Engineering.

[6]  Daniel Marques,et al.  Collective operations in application-level fault-tolerant MPI , 2003, ICS '03.

[7]  Daniel Marques,et al.  Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[8]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[9]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[10]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.