Benefits of Software Rejuvenation on HPC Systems

Rejuvenation is a technique expected to mitigate failures in HPC systems by replacing, repairing, or resetting system components. Because of the small overhead required by software rejuvenation, we primarily focus on OS/kernel rejuvenation. In this paper, we propose three rejuvenation scheduling techniques. Moreover, we investigate the claim that software rejuvenation prolongs failure times in HPC systems. Also, we compare the lost computing times of the checkpoint/restart mechanism with and without rejuvenation after each checkpoint.

[1]  Matteo Sereno,et al.  Fine Grained Software Degradation Models for Optimal Rejuvenation Policies , 2001, Perform. Evaluation.

[2]  Kishor S. Trivedi,et al.  A comprehensive model for software rejuvenation , 2005, IEEE Transactions on Dependable and Secure Computing.

[3]  Kishor S. Trivedi,et al.  An approach for estimation of software aging in a Web server , 2002, Proceedings International Symposium on Empirical Software Engineering.

[4]  Stephen L. Scott,et al.  Reliability of a System of k Nodes for High Performance Computing Applications , 2010, IEEE Transactions on Reliability.

[5]  Zhang Hao,et al.  Performance analysis of software rejuvenation , 2003, Proceedings of the Fourth International Conference on Parallel and Distributed Computing, Applications and Technologies.

[6]  Kishor S. Trivedi,et al.  A measurement-based model for estimation of resource exhaustion in operational software systems , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[7]  Kishor S. Trivedi,et al.  A methodology for detection and estimation of software aging , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[8]  S TrivediKishor,et al.  A Comprehensive Model for Software Rejuvenation , 2005 .

[9]  Kenny C. Gross,et al.  Advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers , 2002, Proceedings International Conference on Dependable Systems and Networks.

[10]  Long Zhao,et al.  Availability and Cost Analysis of a Fault-Tolerant Software System with Rejuvenation , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[11]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.