A faster checkpointing and recovery algorithm with a hierarchical storage approach

Fault tolerance is an inevitable part of cluster operating system. In Score cluster system, it provides coordinated checkpointing, rollback recovery mechanism and watch-dog timer detector for fault tolerance. In the checkpointing algorithm in Score, disk write is the bottleneck. To eliminate disk write overhead, this paper proposes a new diskless checkpointing and rollback recovery algorithm. Since the proposed algorithm does not need to calculate parity and write the checkpointing data into disk, it is analyzed to be a faster checkpointing algorithm than the original one. Based on comparison, the recovery time of the proposed algorithm is also less. However, the cluster can not tolerant multiple transient failure using this diskless checkpointing algorithm. To compensate this drawback, a hierarchical storage strategy is adopted. An experimental result shows that this diskless algorithm with a hierarchical storage approach is fast and effective

[1]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[2]  Y. Ishikawa RWC PC Cluster II and SCore Cluster System Software-High Performance Linux Cluster , 1999 .

[3]  Takashi Nanya,et al.  Evaluation of Checkpointing Mechanism on SCore Cluster System , 2003 .

[4]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[5]  Christian Engelmann,et al.  A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..

[6]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).