A Meta-Learning Failure Predictor for Blue Gene/L Systems

The demand for more computational power in science and engineering has spurred the design and deployment of ever-growing cluster systems. Even though the individual components used in these systems are highly reliable, the presence of large number of components inevitably increases the failure probability of such systems. Successful prediction of potential failures can greatly enhance various fault tolerance mechanisms used in large clusters, thereby mitigating the adverse impact of failures on system productivity and total cost of ownership. In this paper, we present a three-phase failure predictor to automatically process RAS events and further discover failure patterns for prediction in Blue Gene/L systems. In particular, this paper explores the use of meta- learning to adoptively integrate base methods with the goal to boost prediction accuracy. Experiments with two RAS logs collected from Blue Gene/L systems at ANL and SDSC demonstrate the effectiveness of the proposed failure predictor.

[1]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[2]  Fan Zhang,et al.  A statistical approach to predictive detection , 2001, Comput. Networks.

[3]  Stewart W. Wilson,et al.  Learning Classifier Systems, From Foundations to Applications , 2000 .

[4]  Kenny C. Gross,et al.  MSET Performance Optimization for Detection of Software Aging , 2003 .

[5]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[6]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[8]  Wednesday September,et al.  2007 International Conference on Parallel Processing , 2007 .

[9]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[10]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[11]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[12]  Miroslaw Malek,et al.  Advanced Failure Prediction in Complex Software Systems , 2004 .

[13]  George L.-T. Chiu,et al.  Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[14]  Armando Fox,et al.  Three Research Challenges at the Intersection of Machine Learning, Statistical Induction, and Systems , 2005, HotOS.

[15]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[16]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[17]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[18]  Zhiling Lan,et al.  Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[19]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[20]  Ravishankar K. Iyer,et al.  Recognition of Error Symptoms in Large Systems , 1986, FJCC.

[21]  Douglas G. Turnbull Failure Prediction in Hardware Systems , 2022 .

[22]  Kishor S. Trivedi,et al.  A measurement-based model for estimation of resource exhaustion in operational software systems , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[23]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[24]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[25]  Adolfy Hoisie,et al.  Use of Predictive Performance Modeling during Large-scale System Installation , 2005, Parallel Process. Lett..

[26]  Kishor S. Trivedi,et al.  Probabilistic modeling of computer system availability , 1987 .

[27]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).