Performance under failures of high-end computing

Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.

[1]  Kishor S. Trivedi,et al.  Queueing Analysis of Fault-Tolerant Computer Systems , 1986, IEEE Transactions on Software Engineering.

[2]  Peter A. Dinda,et al.  Host load prediction using linear models , 2000, Cluster Computing.

[3]  Charng-da Lu,et al.  Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .

[4]  Meeta Sharma Gupta,et al.  Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[5]  阿部晋树 Fault tolerant computer system , 2005 .

[6]  Atakan Dogan,et al.  Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing , 2000, Proceedings 2000 International Conference on Parallel Processing.

[7]  W YoungJohn A first order approximation to the optimum checkpoint interval , 1974 .

[8]  Niraj K. Jha,et al.  Safety and Reliability Driven Task Allocation in Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[9]  Xian-He Sun,et al.  Performance Modeling and Prediction of Nondedicated Network Computing , 2002, IEEE Trans. Computers.

[10]  Carl M. Harris,et al.  Fundamentals of queueing theory (2nd ed.). , 1985 .

[11]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[12]  Niraj K. Jha,et al.  Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[13]  Kishor S. Trivedi,et al.  Minimizing completion time of a program by checkpointing and rejuvenation , 1996, SIGMETRICS '96.

[14]  Francine Berman,et al.  Adaptive Computing on the Grid Using AppLeS , 2003, IEEE Trans. Parallel Distributed Syst..

[15]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[16]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[17]  Ming Wu,et al.  Grid harvest service: A performance system of grid computing , 2006, J. Parallel Distributed Comput..

[18]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[19]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..