Fault-aware job scheduling for BlueGene/L systems

Summary form only given. Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. We evaluate the effectiveness of a previously developed job scheduling algorithm for BlueGene/L in the presence of faults. We have developed two new job-scheduling algorithms considering failures while scheduling the jobs. We have also evaluated the impact of these algorithms on average bounded slowdown, average response time and system utilization, considering different levels of proactive failure prediction and prevention techniques reported in the literature. Our simulation studies show that the use of these new algorithms with even trivial fault prediction confidence or accuracy levels (as low as 10%) can significantly improve the performance of the BlueGene/L system.

[1]  Keiji Tani,et al.  Job scheduling on the Earth Simulator , 2003 .

[2]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[3]  Attahiru Sule Alfa,et al.  Advances in matrix-analytic methods for stochastic models , 1998 .

[4]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[5]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[6]  Richard Wolski,et al.  Time Sharing Massively Parallel Machines , 1995, ICPP.

[7]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[8]  Daniel P. Siewiorek,et al.  A comparative analysis of event tupling schemes , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[9]  José E. Moreira,et al.  Job Scheduling for the BlueGene/L System , 2002, JSSPP.

[10]  Kavitha Ranganathan,et al.  Decoupling computation and data scheduling in distributed data-intensive applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[11]  Daniel P. Siewiorek,et al.  VAX/VMS event monitoring and analysis , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[12]  Susanne Albers,et al.  Scheduling with unexpected machine breakdowns , 1999, Discret. Appl. Math..

[13]  I. Rish,et al.  Autonomic Computing Features for Large-scale Server Management and Control , 2003 .

[14]  Bala Kalyanasundaram,et al.  Fault-tolerant scheduling , 1994, STOC '94.

[15]  Xiao Qin,et al.  An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems , 2002, Proceedings International Conference on Parallel Processing.

[16]  Marios C. Papaefthymiou,et al.  Stochastic Analysis of Gang Scheduling in Parallel and Distributed Systems , 1996, Perform. Evaluation.

[17]  R. Vilalta,et al.  Providing Persistent and Consistent Resources through Event Log Analysis and Predictions for Large-scale Computing Systems , 2002 .

[18]  Dror G. Feitelson,et al.  Improved Utilization and Responsiveness with Gang Scheduling , 1997, JSSPP.

[19]  J. Moreira,et al.  An Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[20]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  Mark S. Squillante,et al.  Modeling and analysis of dynamic coscheduling in parallel and distributed environments , 2002, SIGMETRICS '02.