Analyzing a Five-Year Failure Record of a Leadership-Class Supercomputer
暂无分享,去创建一个
Esteban Meneses | Don Maxwell | Elvis Rojas | Terry Jones | T. Jones | Don E. Maxwell | Esteban Meneses | Elvis Rojas
[1] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[2] Bin Nie,et al. A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[3] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[4] Saurabh Gupta,et al. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[6] Christian Engelmann,et al. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[7] Zuoning Chen,et al. A Large-Scale Study of Failures on Petascale Supercomputers , 2018, Journal of Computer Science and Technology.
[8] Ricardo Bianchini,et al. System Resilience at Extreme Scale White Paper , 2009 .
[9] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[10] Matt Ezell,et al. Understanding the Impact of Interconnect Failures on System Operation , 2013 .
[11] Christian Engelmann,et al. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer , 2018, 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS).
[12] Scott Atchley,et al. GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[13] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[14] Esteban Meneses,et al. Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer , 2015 .
[15] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[16] Bianca Schroeder,et al. Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[17] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[18] Luigi Carro,et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[19] Narayan Desai,et al. Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[20] Guangwen Yang,et al. Job failures in high performance computing systems: A large-scale empirical study , 2012, Comput. Math. Appl..