Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
暂无分享,去创建一个
Ravishankar K. Iyer | Zbigniew T. Kalbarczyk | Catello Di Martino | William Kramer | Joseph Fullop | Fabio Baccanico | Z. Kalbarczyk | C. Martino | W. Kramer | R. Iyer | Fabio Baccanico | Joseph Fullop
[1] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[2] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[3] Domenico Cotroneo,et al. Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[4] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.
[5] Sudhanva Gurumurthi,et al. Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[6] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[7] Timothy J. Dell,et al. A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .
[8] Xin Chen,et al. Predicting job completion times using system logs in supercomputing clusters , 2013, 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W).
[9] Domenico Cotroneo,et al. Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).
[10] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[11] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] Jon Stearley,et al. A State-Machine Approach to Disambiguating Supercomputer Event Logs , 2012, MAD.
[13] Catello Di Martino. One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window Approach , 2013, ISC.
[14] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[15] Bianca Schroeder,et al. Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.
[16] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[17] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.
[18] Michael R. Lyu,et al. Handbook of software reliability engineering , 1996 .
[19] David A. Patterson,et al. Studying and using failure data from large-scale internet services , 2002, EW 10.