A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.

[1]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[2]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Ravishankar K. Iyer,et al.  LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications , 2015, FTXS@HPDC.

[4]  Domenico Cotroneo,et al.  Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[5]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[6]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[7]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[8]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[9]  Reynold Xin,et al.  Apache Spark , 2016 .

[10]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[11]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[12]  Zheng Liu,et al.  FLAP: An End-to-End Event Log Analysis Platform for System Management , 2017, KDD.

[13]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Özalp Babaoglu,et al.  A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q , 2015, Euro-Par Workshops.

[17]  Christian Engelmann,et al.  Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[18]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).