Predicting Job Failures in AuverGrid Based on Workload Log Analysis

Grid systems are popular today due to their ability to solve large problems in business and science. Job failures which are inherent in any computational environment are more common in grids due to their dynamic and complex nature. Furthermore, traditional methods for job failure recovery have proven costly and thus a need to shift toward proactive and predictive management strategies is necessary in such systems. In this paper, an innovative effort has been made to predict the futurity of jobs in a production grid environment. First of all, we investigated the relationship between workload characteristics and job failures by analyzing workload traces of AuverGrid which is a part of EGEE (Enabling Grids for E-science) project. After the recognition of failure patterns, the success or failure status of jobs during 6 months of AuverGrid activity was predicted with approximately 96% accuracy. The quality of services on the grid can be improved by integrating the result of this work into management services like scheduling and monitoring.

[1]  Brian Tierney,et al.  Log summarization and anomaly detection for troubleshooting distributed systems , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[2]  Rajeev Thakur,et al.  A Fault Diagnosis and Prognosis Service for TeraGrid Clusters , 2007 .

[3]  Michael J. Lewis,et al.  Resource Availability Prediction for Improved Grid Scheduling , 2008, 2008 IEEE Fourth International Conference on eScience.

[4]  David A. Cieslak,et al.  Short Paper: Troubleshooting Distributed Systems via Data Mining , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[5]  A DindaPeter The statistical properties of host load , 1999 .

[6]  Yoichi Muraoka,et al.  Extended forecast of CPU and network load on computational Grid , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[7]  Bharadwaj Veeravalli,et al.  Pro-active failure handling mechanisms for scheduling in grid computing environments , 2010, J. Parallel Distributed Comput..

[8]  Michèle Sebag,et al.  Toward Behavioral Modeling of a Grid System: Mining the Logging and Bookkeeping Files , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[9]  Hui Li,et al.  Job Failure Analysis and Its Implications in a Large-Scale Production Grid , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[10]  Paulo Marques,et al.  DGSchedSim: a trace-driven simulator to evaluate scheduling algorithms for desktop grid environments , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[11]  Guangwen Yang,et al.  Adaptive Hybrid Model for Long Term Load Prediction in Computational Grid , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[12]  Radu Prodan,et al.  Characterizing, Modeling and Predicting Dynamic Resource Availability in a Large Scale Multi-purpose Grid , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[13]  Hui Li,et al.  Mining performance data for metascheduling decision support in the Grid , 2007, Future Gener. Comput. Syst..

[14]  Alexandru Iosup,et al.  Trace-based evaluation of job runtime and queue wait time predictions in grids , 2009, HPDC '09.

[15]  Artur Andrzejak,et al.  Classifier-Based Capacity Prediction for Desktop Grids 1 , 2006 .

[16]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[17]  Richard Wolski,et al.  Predicting the CPU availability of time‐shared Unix systems on the computational grid , 2004, Cluster Computing.

[18]  Song Fu,et al.  Failure-aware resource management for high-availability computing clusters with distributed virtual machines , 2010, J. Parallel Distributed Comput..

[19]  David A. Cieslak,et al.  Troubleshooting thousands of jobs on production grids using data mining techniques , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[20]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[21]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[22]  S. Scott,et al.  A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster , 2004 .

[23]  David A. Cieslak,et al.  Data mining on the grid for the grid , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[24]  Richard Wolski,et al.  Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[25]  Francesco Bonchi,et al.  Hiding Sensitive Trajectory Patterns , 2007 .

[26]  Hui Li,et al.  Workload characterization, modeling, and prediction in grid Computing , 2008 .

[27]  Marios D. Dikaiakos,et al.  Identifying Failures in Grids through Monitoring and Ranking , 2008, 2008 Seventh IEEE International Symposium on Network Computing and Applications.

[28]  Jeffrey F. Naughton,et al.  Issues in applying data mining to grid job failure detection and diagnosis , 2008, HPDC '08.

[29]  Andrew S. Grimshaw,et al.  Failure Prediction in Computational Grids , 2007, 40th Annual Simulation Symposium (ANSS'07).

[30]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[31]  Aisha Hassan Abdalla Hashim,et al.  Execution time prediction of imperative paradigm tasks for grid scheduling optimization , 2009 .

[32]  Christopher E. Dabrowski,et al.  Reliability in grid computing systems , 2009, Concurr. Comput. Pract. Exp..

[33]  Subhash Saini,et al.  Local grid scheduling techniques using performance prediction , 2003 .

[34]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[35]  Radu Prodan,et al.  Short Paper: Data Mining-based Fault Prediction and Detection on the Grid , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[36]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[37]  Warren Smith,et al.  Resource Selection Using Execution and Queue Wait Time Predictions , 2002 .

[38]  Dan Meng,et al.  The Failure-rate Aware Scheduling Policies for Large-scale Cluster Systems , 2006, 2006 Seventh International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'06).

[39]  Guangwen Yang,et al.  A Survey of Methods and Applications for Trace Analysis in Grid Systems , 2008, The Third ChinaGrid Annual Conference (chinagrid 2008).

[40]  Peter A. Dinda,et al.  The statistical properties of host load , 1999, Sci. Program..

[41]  Ian Witten,et al.  Data Mining , 2000 .

[42]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[43]  Rajkumar Buyya,et al.  Global Grids and Software Toolkits: A Study of Four Grid Middleware Technologies , 2004, ArXiv.

[44]  Alexandru Iosup,et al.  How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[45]  John Paul Walters,et al.  Failure Prediction and Scalable Checkpointing for Reliable Large-Scale Grid Computing , 2007 .