Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs
暂无分享,去创建一个
Ravishankar K. Iyer | Zbigniew T. Kalbarczyk | Catello Di Martino | William Kramer | Z. Kalbarczyk | C. Martino | W. Kramer | R. Iyer
[1] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[2] Celso L. Mendes,et al. Deploying a Large Petascale System: The Blue Waters Experience , 2014, ICCS.
[3] Franck Cappello,et al. Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.
[4] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Woo-Sun Yang,et al. Franklin Job Completion Analysis , 2010 .
[6] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[7] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[8] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[9] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[10] Xin Chen,et al. Predicting job completion times using system logs in supercomputing clusters , 2013, 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W).
[11] Domenico Cotroneo,et al. Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).
[12] Andrew L. Rukhin,et al. Analysis of Time Series Structure SSA and Related Techniques , 2002, Technometrics.
[13] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.
[14] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[15] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[16] Ravishankar K. Iyer,et al. Characterization of operational failures from a business data processing SaaS platform , 2014, ICSE Companion.
[17] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[18] Robert G. Edwards,et al. The Chroma Software System for Lattice QCD , 2004 .
[19] Catello Di Martino. One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window Approach , 2013, ISC.
[20] Alan Norton,et al. Petascale WRF simulation of hurricane sandy: Deployment of NCSA's cray XE6 blue waters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[21] Domenico Cotroneo,et al. Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).