论文信息 - Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Accurate failure prediction in conjunction with efficient process migration facilities including some Cloud constructs can enable failure avoidance in large-scale high performance computing (HPC) platforms. In this work we demonstrate a prototype system that incorporates our probabilistic failure prediction system with virtualization mechanisms and techniques to provide a whole system approach to failure avoidance. This work utilizes a failure scenario based on a real-world HPC case study.

[1] Jon Stearley,et al. Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[2] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .

[3] Seetharami R. Seelam,et al. Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[4] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[5] Sebastien Goasguen,et al. A study of a KVM-based cluster for grid computing , 2009, ACM-SE 47.

[6] Christian Engelmann,et al. A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..

[7] S. Scott,et al. Reliability Analysis in HPC clusters , 2006 .

[8] Bianca Schroeder,et al. Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[9] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[10] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[11] Bert J. Debusschere,et al. Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12] Anthony M. Filippi,et al. Effects of virtualization on a scientific application running a hyperspectral radiative transfer code on virtual machines , 2008, HPCVirt '08.

[13] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[14] Jackson Mayo,et al. Methodologies for advance warning of compute cluster problems via statistical analysis: a case study , 2009, Resilience '09.

[15] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[16] Larry Rudolph,et al. Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.

[17] Christian Engelmann,et al. Proactive process-level live migration in HPC environments , 2008, HiPC 2008.