Efficient checkpointing-recovery schemes for distributed systems

In a distributed system, checkpointing and failure-recovery of a process may affect other processes due to the inter-process dependency through message communication. Most of the checkpointing-recovery schemes in the literature try to improve the performance of the processes, either by reducing the effects of checkpointings or by reducing the effects of failures. However, to reduce the effects of checkpointings, it is inevitable to allow more effects of failures and vice versa. A new checkpointing-recovery scheme is proposed for a distributed system, in which the time intervals between the checkpointings are dynamically adjusted, considering the overhead induced by the checkpointing and the failure-recovery at the same time. By properly adjusting the checkpointing interval, the processes can also properly adjust the effects of checkpointings and failures to other processes, so that the overall performance of the processes in the system can be improved over other existing schemes. Moreover, such adjustments can be done at each process, based on its local information without extra communication overhead. The proposed scheme can be easily incorporated into any of the existing schemes. Performance of the processes under the proposed checkpointing-recovery scheme is compared with other existing ones through extensive simulation study. Simulation results show a substantial improvement in the overall performance of the processes and they also indicate that the performance of the processes under the proposed scheme is more stable over the varying system parameters.