As the system complexity increases, the failure probability increases substantially. Therefore, the system requires techniques for supporting fault tolerance. Checkpointing technique is widely used to reduce the execution time of long-running programs in presence of failures and enhancing the reliability of such systems. Several methods were studied thus far in order to determine the checkpointing interval which optimizes system performance. The crucial parameter in all of these solutions is system failure model which is primarily assumed as exponential or Weibull distributions. But, these models are not perfectly accurate since they fail to model the effect of soft errors. In this paper, we introduce a more realistic failure model based on the processors AVF. In addition, we propose three checkpoint placement methods with constant and variable intervals that determine suitable checkpoint places for the proposed failure model. Our experimental results show that our method, which is implementable on any multicore system, can find the suitable points in which checkpoints should be taken.
[1]
Stephen L. Scott,et al.
A reliability-aware approach for an optimal checkpoint/restart model in HPC environments
,
2007,
2007 IEEE International Conference on Cluster Computing.
[2]
Stephen L. Scott,et al.
Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off
,
2005,
2005 IEEE International Conference on Cluster Computing.
[3]
Todd M. Austin,et al.
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor
,
2003,
MICRO.
[4]
J. Fortes,et al.
Sim-SODA : A Unified Framework for Architectural Level Software Reliability Analysis
,
2006
.
[5]
Joseph A. Catania.
Soft Errors in Electronic Memory – A White Paper
,
2022
.