Job Failure Analysis and Its Implications in a Large-Scale Production Grid

In this paper we present an initial analysis of job failures in a large-scale data-intensive Grid. Based on three representative periods in production, we characterize the interarrival times and life spans of failed jobs. Different failure types are distinguished and the analysis is carried out further at the Virtual Organization (VO) level. The spatial behavior, namely where job failures occur in the Grid, is also examined. Cross-correlation structures, including how arrivals correlate with life spans of job failures, are analyzed and illustrated. We further investigate statistical models to fit the failure data and propose several failureaware scheduling strategies at the Grid level. Our results show that the overall failure rates in the Grid are quite significant, ranging from 25% to 33% of all submitted jobs. However, only 5% to 8% of the jobs failed after running on a certain Computing Element (CE). The rest of failed jobs are aborted or cancelled without running. A majority of failed jobs come from several large production VOs and a large amount of these failures are centered around several main CEs. The interarrival time processes of failed jobs are shown to be bursty, and the life spans exhibit strong autocorrelations. Based on the failure patterns we argue that it is important for the Grid resource brokers to track historical failure and take it into account in decision making. Some proactive measures and accountability issues are also discussed.

[1]  Christian Engelmann,et al.  Job-Site Level Fault Tolerance for Cluster and Grid environments , 2005, 2005 IEEE International Conference on Cluster Computing.

[2]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[3]  Gregory R. Ganger,et al.  Generating Representative Synthetic Workloads: An Unsolved Problem , 1995 .

[4]  Walter Willinger,et al.  Stochastic modeling of traffic processes , 1998 .

[5]  Michael Muskulus,et al.  Modeling Job Arrivals in a Data-Intensive Grid , 2006, JSSPP.

[6]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[7]  Christos Faloutsos,et al.  Capturing the spatio-temporal behavior of real traffic data , 2002, Perform. Evaluation.

[8]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[9]  Hui Li,et al.  Mining performance data for metascheduling decision support in the Grid , 2007, Future Gener. Comput. Syst..

[10]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.