A survey on software checkpointing and mobility techniques in distributed systems

This paper has two purposes. First, it shows that existing solutions employing checkpointing and mobility in distributed applications, fix, at design time, the types and the techniques of checkpointing and mobility to be employed at runtime. Second, it provides policies allowing the dynamic selection of checkpointing and mobility techniques according to the execution environment. For that, this paper presents checkpointing and mobility techniques to evaluate them in order to identify their advantages and their shortcomings, and then determine the appropriate execution conditions in which a specific mobility technique becomes beneficial. This investigation will allow in assisting adaptation plans' generation and promoting future research in the area of self‐adaptivity of distributed applications. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[2]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[3]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[4]  Axel W. Krings,et al.  A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing , 2005, Euro-Par.

[5]  Thomas Hérault,et al.  Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid , 2005, Future Gener. Comput. Syst..

[6]  Danny B. Lange,et al.  Programming and Deploying Java¿ Mobile Agents with Aglets¿ , 1998 .

[7]  Achour Mostéfaoui,et al.  Virtual Precedence in Asynchronous Systems: Cencept and Applications , 1997, WDAG.

[8]  Henri E. Bal,et al.  Transparent Fault Tolerance for Grid Applications , 2005, EGC.

[9]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[10]  Boleslaw K. Szymanski,et al.  The Internet Operating System: Middleware for Adaptive Distributed Computing , 2006, Int. J. High Perform. Comput. Appl..

[11]  Roberto Baldoni,et al.  An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[12]  Liang Chen,et al.  Supporting fault-tolerance in streaming grid applications , 2007, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  C. Reich,et al.  Engineering an Autonomic Container for WSRF-Based Web Services , 2007, 15th International Conference on Advanced Computing and Communications (ADCOM 2007).

[14]  David Sinreich,et al.  An architectural blueprint for autonomic computing , 2006 .

[15]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[16]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[17]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[18]  D. Nurmi Model-Based Checkpoint Scheduling for Volatile Resource Environments , 2004 .

[19]  Jason Maassen,et al.  Fault-Tolerant Scheduling of Fine-Grained Tasks in Grid Environments , 2006, Int. J. High Perform. Comput. Appl..

[20]  Andrzej M. Goscinski,et al.  Self Healing and Self Configuration in a WSRF Grid Environment , 2005, ICA3PP.

[21]  Franck Cappello,et al.  Coordinated checkpoint versus message log for fault tolerant MPI , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.

[22]  Fabio Kon,et al.  Checkpointing BSP parallel applications on the InteGrade Grid middleware , 2006, Concurr. Comput. Pract. Exp..

[23]  Mark A. Franklin,et al.  Checkpointing in Distributed Computing Systems , 1996, J. Parallel Distributed Comput..

[24]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[25]  Robert S. Gray,et al.  Agent Tcl: a Exible and Secure Mobile-agent System , 1996 .

[26]  Boleslaw K. Szymanski,et al.  Towards a middleware framework for dynamically reconfigurable scietific computing , 2004, High Performance Computing Workshop.

[27]  W. Kent Fuchs,et al.  CATCH-compiler-assisted techniques for checkpointing , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[28]  Marco Danelutto,et al.  ASSIST As a Research Framework for High-Performance Grid Programming Environments , 2006, Grid Computing: Software Environments and Tools.

[29]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[30]  W. Kent Fuchs,et al.  Consistent Global Checkpoints Based on Direct Dependency Tracking , 1994, Inf. Process. Lett..

[31]  Francine Berman,et al.  Adaptive Computing on the Grid Using AppLeS , 2003, IEEE Trans. Parallel Distributed Syst..

[32]  Achour Mostéfaoui,et al.  Communication-Induced Determination of Consistent Snapshots , 1999, IEEE Trans. Parallel Distributed Syst..

[33]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[34]  József Kovács,et al.  Transparent parallel checkpointing and migration in clusters and ClusterGrids , 2009, Int. J. Comput. Sci. Eng..

[35]  Mohamed Jmaiel,et al.  A serialization based approach for strong mobility of shared object , 2007, PPPJ.

[36]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[37]  Jack Dongarra,et al.  Self adaptivity in Grid computing: Research Articles , 2005 .

[38]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[39]  Roberto Baldoni,et al.  Direct dependency-based determination of consistent global checkpoints , 2001, Comput. Syst. Sci. Eng..

[40]  John Shalf,et al.  The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment , 2001, Int. J. High Perform. Comput. Appl..

[41]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[42]  Paul Avery,et al.  SPHINX: a fault-tolerant system for scheduling in dynamic grid environments , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[43]  WangYi-Min Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints , 1997 .

[44]  Sathish S. Vadhiyar,et al.  Self adaptivity in Grid computing , 2005, Concurr. Pract. Exp..

[45]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[46]  William H. Sanders,et al.  Performance analysis of two time-based coordinated checkpointing protocols , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[47]  Gian Pietro Picco,et al.  Understanding code mobility , 1998, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[48]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[49]  Achour Mostéfaoui,et al.  Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[50]  Fabio Kon,et al.  Checkpointing BSP parallel applications on the InteGrade Grid middleware: Research Articles , 2006 .

[51]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.