Critical event prediction for proactive management in large-scale computer clusters

As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.

[1]  T. Nakagawa,et al.  The Discrete Weibull Distribution , 1975, IEEE Transactions on Reliability.

[2]  Thomas G. Dietterich,et al.  Discovering Patterns in Sequences of Events , 1985, Artif. Intell..

[3]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[4]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[5]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[6]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[7]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[8]  Daniel P. Siewiorek,et al.  VAX/VMS event monitoring and analysis , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[9]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[10]  Daniel P. Siewiorek,et al.  A comparative analysis of event tupling schemes , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[11]  Richard A. Davis,et al.  Introduction to time series and forecasting , 1998 .

[12]  Nir Friedman,et al.  Learning Belief Networks in the Presence of Missing Values and Hidden Variables , 1997, ICML.

[13]  Anoop Singhal,et al.  ANSWER: Network Monitoring Using Object-Oriented Rules , 1998, AAAI/IAAI.

[14]  Kevin P. Murphy,et al.  Learning the Structure of Dynamic Probabilistic Networks , 1998, UAI.

[15]  Gary M. Weiss,et al.  Implementing Design Patterns with Object-Oriented Rules , 1998, J. Object Oriented Program..

[16]  Haym Hirsh,et al.  Learning to Predict Rare Events in Event Sequences , 1998, KDD.

[17]  Gary M. Weiss Timeweaver: a genetic algorithm for identifying predictive patterns in sequences of events , 1999 .

[18]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[19]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[20]  Peter A. Dinda,et al.  A prediction-based real-time scheduling advisor , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[21]  Joseph L. Hellerstein,et al.  Predictive algorithms in the management of computer systems , 2002, IBM Syst. J..

[22]  Ricardo Vilalta,et al.  A Classification Approach for Prediction of Target Events in Temporal Sequences , 2002, PKDD.