Doomsday: Predicting Which Node Will Fail When on Supercomputers

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems.

[1]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[2]  John Liagouris,et al.  Online Reconstruction of Structural Information from Datacenter Logs , 2017, EuroSys.

[3]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[5]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[6]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[7]  Guofei Jiang,et al.  LogMine: Fast Pattern Recognition for Log Analytics , 2016, CIKM.

[8]  Zhiling Lan,et al.  A Scalable, Non-Parametric Method for Detecting Performance Anomaly in Large Scale Computing , 2016, IEEE Transactions on Parallel and Distributed Systems.

[9]  John Shalf,et al.  Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.

[10]  Lars Grunske,et al.  Hora: Architecture-aware online failure prediction , 2017, J. Syst. Softw..

[11]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[12]  Jannis Klinkenberg,et al.  Data Mining-Based Analysis of HPC Center Operations , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[14]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[15]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[16]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[17]  Franck Cappello,et al.  Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.

[18]  Thomas Hérault,et al.  Failure Detection and Propagation in HPC systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Andrew McCallum,et al.  FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs , 2009, NIPS.

[20]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[21]  Ann C. Gentile,et al.  New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup , 2015, 2015 IEEE International Conference on Cluster Computing.

[22]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23]  Sally A. McKee,et al.  Digging deeper into cluster system logs for failure prediction and root cause diagnosis , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[24]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[25]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[26]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27]  Zhiling Lan,et al.  Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[28]  Yu Luo,et al.  Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle , 2016, OSDI.

[29]  Nithin Nakka,et al.  Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[30]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[31]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Ravishankar K. Iyer,et al.  LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications , 2015, FTXS@HPDC.

[33]  Frank Mueller,et al.  DINO: Divergent node cloning for sustained redundancy in HPC☆ , 2017 .

[34]  Chen-Yong Cher,et al.  A System Software Approach to Proactive Memory-Error Avoidance , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Christian Engelmann,et al.  Proactive process-level live migration in HPC environments , 2008, HiPC 2008.

[36]  Wolfgang E. Nagel,et al.  Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[37]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[38]  Zhiling Lan,et al.  Filtering log data: Finding the needles in the Haystack , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[39]  Yves Robert,et al.  Impact of fault prediction on checkpointing strategies , 2012, ArXiv.

[40]  Zhiling Lan,et al.  Exploring void search for fault detection on extreme scale systems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[41]  Zhiling Lan,et al.  3-Dimensional root cause diagnosis via co-analysis , 2012, ICAC '12.

[42]  Sherif Abdelwahed,et al.  Towards Proactive Fault Management of Enterprise Systems , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[43]  Bianca Schroeder,et al.  Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[44]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[45]  Christopher D. Carothers,et al.  An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..

[46]  Ricardo Bianchini,et al.  System Resilience at Extreme Scale White Paper , 2009 .

[47]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[48]  Rajeev Thakur,et al.  A study of dynamic meta-learning for failure prediction in large-scale systems , 2010, J. Parallel Distributed Comput..

[49]  Xiao Yu,et al.  CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs , 2016, ASPLOS.

[50]  Ravishankar K. Iyer,et al.  Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[51]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[52]  Franck Cappello,et al.  Failure prediction for HPC systems and applications , 2013, Int. J. High Perform. Comput. Appl..

[53]  Saurabh Gupta,et al.  Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[54]  Arshad Jhumka,et al.  Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[55]  Franck Cappello,et al.  Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[56]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[57]  Franck Cappello,et al.  LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).