ChkSim : A Distributed Checkpointing Simulator

ChkSim is a portable tool to simulate the execution of checkpointing algorithms in distributed applications. It provides quantitative data to system and algorithm designers, enabling the comparative assessment of these algorithms. This report describes the ChkSim simulation model, its software architecture and user manual.

[1]  Luiz Eduardo Buzato,et al.  RDT-Partner: An Efficient Checkpointing Protocol that Enforces Rollback-Dependency Trackability , 2001 .

[2]  Roberto Baldoni,et al.  An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[3]  Bruno Ciciani,et al.  A VP-accordant checkpointing protocol preventing useless checkpoints , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[4]  Mukesh Singhal,et al.  Checkpointing with mutable checkpoints , 2003, Theor. Comput. Sci..

[5]  Dhiraj K. Pradhan,et al.  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.

[6]  Luiz Eduardo Buzato,et al.  An efficient checkpointing protocol for the minimal characterization of operational rollback-dependency trackability , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[7]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[8]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[9]  Hon Fung Li,et al.  Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[10]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[11]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[12]  Achour Mostéfaoui,et al.  Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[13]  Luiz Eduardo Buzato,et al.  Systematic Analysis of Index-Based Checkpointing Algorithms using Simulation , 2007 .

[14]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[15]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[16]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.