Every day, supercomputers execute 1000s of jobs with different characteristics. Data centers monitor the behavior of jobs to support the users and improve the infrastructure, for instance, by optimizing jobs or by determining guidelines for the next procurement. The classification of jobs into groups that express similar run-time behavior aids this analysis as it reduces the number of representative jobs to look into. It is state of the practice to investigate job similarity by looking into job profiles that summarize the dynamics of job execution into one dimension of statistics and neglect the temporal behavior. In this work, we utilize machine learning techniques to cluster and classify parallel jobs based on the similarity in their temporal IO behavior to highlight the importance of temporal behavior when comparing jobs. Our contribution is the qualitative and quantitative evaluation of different IO characterizations and similarity measurements that work toward the development of a suitable clustering algorithm. We explore IO characteristics from monitoring data of one million parallel jobs and cluster them into groups of similar jobs. Therefore, the time series of various IO statistics is converted into features using different similarity metrics that customize the classification. We discuss conventional ML techniques that are applied to job profiles and contrast this with the analysis of time series data where we apply the Levenshtein distance as a distance metrics. While the employed Levenshtein algorithms aren’t yet optimal, the results suggest that temporal behavior is key to identify related pattern.
[1]
Julian M. Kunkel,et al.
Benefit of DDN's IME-FUSE for I/O Intensive HPC Applications
,
2018,
ISC Workshops.
[2]
Robert Latham,et al.
Understanding and improving computational science storage access through continuous characterization
,
2011,
2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).
[3]
Kevin Harms,et al.
TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis
,
2018
.
[4]
Philip H. Carns,et al.
Tools for Analyzing Parallel I/O
,
2018,
ISC Workshops.
[5]
Teng Wang,et al.
BurstMem: A high-performance burst buffer system for scientific applications
,
2014,
2014 IEEE International Conference on Big Data (Big Data).
[6]
Harvey Richardson,et al.
LASSI: Metric Based I/O Analytics for HPC
,
2019,
2019 Spring Simulation Conference (SpringSim).
[7]
Julian M. Kunkel,et al.
Tracking User-Perceived I/O Slowdown via Probing
,
2019,
ISC Workshops.
[8]
Eugen Betke,et al.
Semi-automatic Assessment of I/O Behavior by Inspecting the Individual Client-Node Timelines - An Explorative Study on 106 Jobs
,
2020,
ISC.