Selecting the Checkpoint Interval in Time Warp Parallel Simulation

In Time Warp parallel simulation, a process executes every message as soon as it arrives. If a message with a smaller timestamp subsequently arrives, the process rolls back its state to the time of the earlier message and re-executes from that point. Clearly, the state of each process must be saved (checkpointed) regularly in case a rollback is necessary. Although most existing Time Warp implementations checkpoint after every state transition, this is not necessary, and the checkpoint interval is in reality a tuning parameter of the simulation. Lin and Lazowska[7] proposed a model to derive the optimal checkpoint interval by assuming that the rollback behavior of Time Warp is not affected by the frequency of checkpointing. An experimental study conducted by Preiss et al.[11] indicates that the behavior of rollback is affected by the frequency of checkpointing in general, and that the Lin-Lazowska model may not reflect the real situations in general. This paper extends the Lin-Lazowska model to include the effect of the checkpoint interval on the rollback behavior. The relationship among the checkpoint interval, the rollback behavior, and the overhead associated with state saving and restoration is described. A checkpoint interval selection algorithm which quickly determines the optimal checkpoint interval during the execution of Time Warp simulation is proposed. Empirical results indicate that the algorithm converges quickly and always selects the optimal checkpoint interval.