A Study on the Optimal Heartbeat Interval for High Available Systems

Abstrat A high-available cluster system provides the service without interruptions when a failure occurs in any node consisting of a cluster. Each node sends heartbeat signal periodically to other nodes in the system to indicate that it is still alive. Checkpoint and rollback schemes are used to reduce a loss of computation in the presence of failures. This paper analyzes the expected task execution cost depending on a checkpoint interval and a heartbeat interval, and compares the performance. From this analysis, we can choose the optimal heartbeat interval as well as checkpoint interval to minimize cost for task execution.