Failure Prediction Mechanisms in Cluster Systems

Clustering is an important technique for improving the performance and availability of computer systems. The use of cluster systems is also continuously growing because they present excellent features like scalability, high availability and high performance computing. Availability is mainly administered by failure detection and recovery mechanism, including proactive failure mechanisms that try to prevent occurrences of faults. Given the criticality and importance of availability for high performance computing, this paper uniquely surveyes noticeable existing mechanisms for prevention of faults in high availability and high performance computing cluster systems, and presents a comparative overview.

[1]  Kishor S. Trivedi,et al.  A proactive approach towards always-on availability in broadband cable networks , 2005, Comput. Commun..

[2]  Liudong Xing Reliability analysis of fault-tolerant systems with common-cause failures , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[3]  Kishor S. Trivedi,et al.  Proactive management of software aging , 2001, IBM J. Res. Dev..

[4]  David Lorge Parnas,et al.  Software aging , 1994, Proceedings of 16th International Conference on Software Engineering.

[5]  William H. Sanders,et al.  A performability-oriented software rejuvenation framework for distributed applications , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[6]  Marina Papatriantafilou,et al.  Dynamic and fault-tolerant cluster management , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[7]  Yuanyuan Zhou,et al.  Fast cluster failover using virtual memory-mapped communication , 1999, ICS '99.

[8]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[9]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[10]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[11]  Christian Engelmann,et al.  Job-Site Level Fault Tolerance for Cluster and Grid environments , 2005, 2005 IEEE International Conference on Cluster Computing.

[12]  George Candea,et al.  Reducing recovery time in a small recursively restartable system , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  Wei Xie,et al.  Performability analysis of clustered systems with rejuvenation under varying workload , 2007, Perform. Evaluation.

[14]  Zhiling Lan,et al.  Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[15]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[16]  Füsun Özgüner,et al.  Enhanced Cluster k-Ary n-Cube, A Fault-Tolerant Multiprocessor , 2003, IEEE Trans. Computers.

[17]  Kishor S. Trivedi,et al.  Proactive management of software systems: analysis and implementation , 2002 .

[18]  Kishor S. Trivedi,et al.  Performance Assurance via Software Rejuvenation: Monitoring, Statistics and Algorithms , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[19]  C. Siva Ram Murthy,et al.  Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems , 1997, IEEE Trans. Computers.

[20]  Kishor S. Trivedi,et al.  Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.

[21]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[22]  Elaine J. Weyuker,et al.  Ensuring system performance for cluster and single server systems , 2007, J. Syst. Softw..

[23]  Kishor S. Trivedi,et al.  A measurement-based model for estimation of resource exhaustion in operational software systems , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[24]  Kishor S. Trivedi,et al.  Stochastic Reward Nets for Reliability Prediction , 1996 .

[25]  Kishor S. Trivedi,et al.  A comprehensive model for software rejuvenation , 2005, IEEE Transactions on Dependable and Secure Computing.

[26]  Jason Duell,et al.  Requirements for Linux Checkpoint/Restart , 2002 .

[27]  Lars Lundberg,et al.  Optimal Recovery Schemes for High-Availability Cluster and Distributed Computing , 2001, J. Parallel Distributed Comput..

[28]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .