Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance approach can only improve system productivity marginally. Leveraging the recent progress made in the field of failure prediction, we propose fault-driven rescheduling (FARS) to improve system resilience to failures, and investigate the feasibility and effectiveness of utilizing failure prediction to dynamically adjust the placement of active jobs (e.g. running jobs) in response to failure prediction. In particular, a rescheduling algorithm is designed to enable effective job adjustment by evaluating performance impact of potential failures and rescheduling on user jobs. The proposed FARS complements existing research on fault-aware scheduling by allowing user jobs to avoid imminent failures at runtime. We evaluate FARS by using actual workloads and failure events collected from production HPC systems. Our preliminary results show the potential of FARS on improving system resilience to failures.

[1]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  Kang G. Shin,et al.  A Fault-Tolerant Scheduling Algorithm for Real-Time Periodic Tasks with Possible Software Faults , 2003, IEEE Trans. Computers.

[3]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[4]  S. Scott,et al.  A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster , 2004 .

[5]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[6]  Miroslaw Malek,et al.  Advanced Failure Prediction in Complex Software Systems , 2004 .

[7]  Atakan Dogan,et al.  Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing , 2000, Proceedings 2000 International Conference on Parallel Processing.

[8]  J. P. Herzog,et al.  Application of a model-based fault detection system to nuclear plant signals , 1997 .

[9]  Jonathan M. Smith,et al.  A survey of process migration mechanisms , 1988, OPSR.

[10]  Charng-da Lu,et al.  Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .

[11]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[12]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[13]  J.-P. Wang,et al.  Task Allocation for Maximizing Reliability of Distributed Computer Systems , 1992, IEEE Trans. Computers.

[14]  Francine Berman,et al.  New Grid Scheduling and Rescheduling Methods in the GrADS Project , 2004, IPDPS Next Generation Software Program - NSFNGS - PI Workshop.

[15]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[16]  Cong Du,et al.  MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[17]  Hong Jiang,et al.  A Dynamic and Reliability-Driven Scheduling Algorithm for Parallel Real-time Jobs on Heterogeneous Clusters , 2005 .

[18]  Zhiling Lan,et al.  Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[19]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[20]  Fabrizio Petrini,et al.  System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[21]  Richard Wolski,et al.  Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[22]  Niraj K. Jha,et al.  Safety and Reliability Driven Task Allocation in Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[23]  Xiao Qin,et al.  A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters , 2005, J. Parallel Distributed Comput..

[24]  Wednesday September,et al.  2007 International Conference on Parallel Processing , 2007 .

[25]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .