Failure prediction for autonomic management of networked computer systems with availability assurance

Networked computer systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators. To achieve self-management of failures and resources in networked computer systems, we propose a framework for autonomic failure management with hierarchical failure prediction functionality for large coalition systems, such as coalition clusters and compute grids. It analyzes node, cluster and system wide failure behaviors and forecasts the prospective failure occurrences based on quantified failure dynamics. Failure correlations are inspected by the predictor. Experimental results in a computational grid on campus show the offline and online predictions by our predictors accurately forecast the failure trend and capture failure correlations in the production environment.

[1]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[2]  Charng-Da Lu,et al.  Big Systems and Big Reliability Challenges , 2003, PARCO.

[3]  Swapna S. Gokhale,et al.  Analytical Models for Architecture-Based Software Reliability Prediction: A Unification Framework , 2006, IEEE Transactions on Reliability.

[4]  Song Fu,et al.  Failure-aware resource management for high-availability computing clusters with distributed virtual machines , 2010, J. Parallel Distributed Comput..

[5]  Song Fu Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[6]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[7]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[8]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[9]  J. Berger,et al.  Objective Bayesian Analysis of Spatially Correlated Data , 2001 .

[10]  Mourad Hakem,et al.  Reliability and Scheduling on Systems Subject to Failures , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[11]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[12]  Zhiling Lan,et al.  Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[13]  Xiao Qin,et al.  A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters , 2005, J. Parallel Distributed Comput..

[14]  Willy Zwaenepoel,et al.  Dynamic content web applications: Crash, failover, and recovery analysis , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[15]  Atakan Dogan,et al.  Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[16]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[17]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[18]  Larry Rudolph,et al.  Probabilistic QoS guarantees for supercomputing systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[19]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[20]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems , 2007, USENIX Annual Technical Conference.

[21]  Gao Wen,et al.  A proactive fault-detection mechanism in large-scale cluster systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[22]  F. Mueller,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Ming Wu,et al.  Performance under failures of high-end computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[24]  Cheng-Zhong Xu,et al.  Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[25]  Cheng-Zhong Xu,et al.  Proactive Resource Management for Failure Resilient High Performance Computing Clusters , 2009, 2009 International Conference on Availability, Reliability and Security.

[26]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[27]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[28]  Miroslaw Malek,et al.  Predicting failures of computer systems: a case study for a telecommunication system , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[29]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.

[30]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[31]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[32]  Robert G. Gallager,et al.  Discrete Stochastic Processes , 1995 .

[33]  Brian D. Noble,et al.  Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[34]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[35]  Zhiling Lan,et al.  A fast restart mechanism for checkpoint/recovery protocols in networked environments , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[36]  Roy Friedman,et al.  Model-based performance evaluation of distributed checkpointing protocols , 2008, Perform. Evaluation.

[37]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[38]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[39]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[40]  Samiha Mourad,et al.  On the Reliability of the IBM MVS/XA Operating System , 1987, IEEE Transactions on Software Engineering.

[41]  Daniel Marques,et al.  Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, IPDPS.

[42]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[43]  Luiz C. Alves,et al.  Reliability, availability, and serviceability (RAS) of the IBM eServer z990 , 2004, IBM J. Res. Dev..