Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems
暂无分享,去创建一个
Vincent De Sapio | Ann C. Gentile | Diana C. Roe | David C. Thompson | Jackson Mayo | Philippe P. Pébay | Jim M. Brandt | Matthew Wong | Frank Chen
[1] Jon Stearley,et al. Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[2] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[3] Seetharami R. Seelam,et al. Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).
[4] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[5] Sebastien Goasguen,et al. A study of a KVM-based cluster for grid computing , 2009, ACM-SE 47.
[6] Christian Engelmann,et al. A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..
[7] S. Scott,et al. Reliability Analysis in HPC clusters , 2006 .
[8] Bianca Schroeder,et al. Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.
[9] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[10] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[11] Bert J. Debusschere,et al. Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[12] Anthony M. Filippi,et al. Effects of virtualization on a scientific application running a hyperspectral radiative transfer code on virtual machines , 2008, HPCVirt '08.
[13] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.
[14] Jackson Mayo,et al. Methodologies for advance warning of compute cluster problems via statistical analysis: a case study , 2009, Resilience '09.
[15] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.
[16] Larry Rudolph,et al. Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.
[17] Christian Engelmann,et al. Proactive process-level live migration in HPC environments , 2008, HiPC 2008.