Reducing Waste in Extreme Scale Systems through Introspective Analysis

Resilience is an important challenge for extreme-scale supercomputers. Today, failures in supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. Our study of the failure logs of multiple supercomputers show that periods of higher failure density occur with up to three times more than the average. We design a monitoring system that listens to hardware events and forwards important events to the runtime to detect those regime changes. We implement a runtime capable of receiving notifications and adapt dynamically. In addition, we build an analytical model to predict the gains that such dynamic approach could achieve. We demonstrate that in some systems, our approach can reduce the wasted time by over 30%.

[1]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[3]  Christopher D. Carothers,et al.  An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..

[4]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Kishor S. Trivedi,et al.  Proactive management of software aging , 2001, IBM J. Res. Dev..

[6]  Saurabh Gupta,et al.  Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[7]  Kenji Yamanishi,et al.  Dynamic syslog mining for network failure monitoring , 2005, KDD '05.

[8]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[9]  Buddy Bland,et al.  Titan - Early experience with the Titan system at Oak Ridge National Laboratory , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[10]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[11]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[12]  Thanadech Thanakornworakij,et al.  Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications , 2013, Int. J. High Perform. Comput. Appl..

[13]  Bin Nie,et al.  A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Zhiling Lan,et al.  Anomaly localization in large-scale clusters , 2007, 2007 IEEE International Conference on Cluster Computing.

[15]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[16]  Dror G. Feitelson The supercomputer industry in light of the Top500 data , 2005, Computing in Science & Engineering.

[17]  Christian Engelmann,et al.  Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.

[18]  Felix Salfner,et al.  Modeling Event-driven Time Series with Generalized Hidden Semi-Markov Models , 2006 .

[19]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[20]  Luís Moura Silva,et al.  Deterministic Models of Software Aging and Optimal Rejuvenation Schedules , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[21]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[22]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[23]  Charng-Da Lu Failure Data Analysis of HPC Systems , 2013, ArXiv.

[24]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[25]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[26]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[27]  W YoungJohn A first order approximation to the optimum checkpoint interval , 1974 .

[28]  Saurabh Gupta,et al.  Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Pete Beckman,et al.  Argo: An Exascale Operating System and Runtime , 2015 .