Methodologies for advance warning of compute cluster problems via statistical analysis: a case study
暂无分享,去创建一个
Jackson Mayo | Ann Gentile | Jim Brandt | Matthew Wong | David Thompson | Philippe Pébay | Diana Roe
[1] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[2] S. Scott,et al. Reliability Analysis in HPC clusters , 2006 .
[3] Cheng-Zhong Xu,et al. Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[4] Ann C. Gentile,et al. Meaningful Automated Statistical Analysis of Large Computational Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.
[5] Anand Sivasubramaniam,et al. Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[6] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[7] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[8] Jon Stearley,et al. Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[9] Bert J. Debusschere,et al. Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[10] Bert J. Debusschere,et al. Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[11] Ron A. Oldfield. Lightweight storage and overlay networks for fault tolerance. , 2006 .