Reducing Waste in Extreme Scale Systems through Introspective Analysis
暂无分享,去创建一个
Franck Cappello | Saurabh Gupta | Devesh Tiwari | Christian Engelmann | Marc Snir | Ana Gainaru | Swann Perarnau | Leonardo Bautista-Gomez
[1] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[2] Jon Stearley,et al. Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[3] Christopher D. Carothers,et al. An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..
[4] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[5] Kishor S. Trivedi,et al. Proactive management of software aging , 2001, IBM J. Res. Dev..
[6] Saurabh Gupta,et al. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[7] Kenji Yamanishi,et al. Dynamic syslog mining for network failure monitoring , 2005, KDD '05.
[8] Narayan Desai,et al. Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[9] Buddy Bland,et al. Titan - Early experience with the Titan system at Oak Ridge National Laboratory , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
[10] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[11] David A. Patterson,et al. Path-Based Failure and Evolution Management , 2004, NSDI.
[12] Thanadech Thanakornworakij,et al. Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications , 2013, Int. J. High Perform. Comput. Appl..
[13] Bin Nie,et al. A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[14] Zhiling Lan,et al. Anomaly localization in large-scale clusters , 2007, 2007 IEEE International Conference on Cluster Computing.
[15] Saurabh Gupta,et al. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[16] Dror G. Feitelson. The supercomputer industry in light of the Top500 data , 2005, Computing in Science & Engineering.
[17] Christian Engelmann,et al. Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.
[18] Felix Salfner,et al. Modeling Event-driven Time Series with Generalized Hidden Semi-Markov Models , 2006 .
[19] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[20] Luís Moura Silva,et al. Deterministic Models of Software Aging and Optimal Rejuvenation Schedules , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.
[21] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[22] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[23] Charng-Da Lu. Failure Data Analysis of HPC Systems , 2013, ArXiv.
[24] Shekhar Y. Borkar,et al. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.
[25] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[26] Luigi Carro,et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[27] W YoungJohn. A first order approximation to the optimum checkpoint interval , 1974 .
[28] Saurabh Gupta,et al. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[29] Pete Beckman,et al. Argo: An Exascale Operating System and Runtime , 2015 .