Job Classification Through Long-Term Log Analysis Towards Power-Aware HPC System Operation

High utilization of HPC system resources under constraints in electric power consumption or I/O workload is one of the primary goals to deal with high demand from application users. Utilization of CPU and memory, which is tightly related to electric power consumption, is counterpart metric of I/O activities in most HPC jobs. Towards higher utilization of HPC systems under restriction in management for electric power consumption and I/O activities, we need to care not to have hot-spots in power consumption or I/O operations because such situation leads to unstable system operation by exceeding capability of electric power supply or the I/O subsystem in such hot-spots. Analysis of a huge scale of log data collected from the K computer has revealed high correlation between I/O activities and CPU and memory utilization in some specific compute node layouts, showing unique characteristics of HPC jobs such as computation intensive or I/O-intensive. It has turned out that classifying jobs in terms of required electric power can divide into two groups, jobs consuming high electric power and I/O-intensive jobs. We have succeeded in job classification by achieving high correctness using machine learning approach, and we have confirmed effectiveness of the classification towards power-aware system operation in our next HPC system, the supercomputer Fugaku.

[1]  Kevin Harms,et al.  TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis , 2018 .

[2]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[3]  Haihang You,et al.  Comprehensive Workload Analysis and Modeling of a Petascale Supercomputer , 2012, JSSPP.

[4]  Shilin He,et al.  Experience Report: System Log Analysis for Anomaly Detection , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[5]  Mathilde Mougeot,et al.  Predicting Job Power Consumption Based on RJMS Submission Data in HPC Systems , 2020, ISC.

[6]  S. Sumimoto,et al.  File System and Power Management Enhanced for Supercomputer Fugaku , 2020 .

[7]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[8]  Shane Snyder,et al.  IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[9]  Michael Ott,et al.  A Novel Approach for Job Scheduling Optimizations Under Power Cap for ARM and Intel HPC Systems , 2017, 2017 IEEE 24th International Conference on High Performance Computing (HiPC).

[10]  Laurent Lefèvre,et al.  Towards Energy Budget Control in HPC , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[11]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  Scott B. Baden,et al.  Doomsday: Predicting Which Node Will Fail When on Supercomputers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Kesheng Wu,et al.  Towards HPC I/O Performance Prediction through Large-scale Log Analysis , 2020, HPDC.

[16]  Yang Liu,et al.  Server-Side Log Data Analytics for I/O Workload Characterization and Coordination on Large Shared Storage Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.