HPC Workload Characterization Using Feature Selection and Clustering

Large high-performance computers (HPC) are expensive tools responsible for supporting thousands of scientific applications. However, it is not easy to determine the best set of configurations for workloads to best utilize the storage and I/O systems. Users typically use the default configurations provided by the system administrators, which typically results in poor performance. In an effort to identify application characteristics more important to I/O performance, we applied several machine learning techniques to characterize these applications. To identify the features that are most relevant to the I/O performance, we evaluate a number of different feature selection methods, e.g., Mutual information regression and F regression, and develop a novel feature selection method based on Min-max mutual information. These feature selection methods allow us to sift through a large set of the real-world workloads collected from NERSC's Cori supercomputer system, and identify the most important features. We employ a number of different clustering algorithms, including KMeans, Gaussian Mixture Model (GMM) and Ward linkage, and measure the cluster quality with Davies Boulder Index (DBI), Silhouette and a new Combined Score developed for this work. The cluster evaluation result shows that the test dataset could be best divided into three clusters, where cluster 1 contains mostly small jobs with operations on standard I/O units, cluster 2 consists of middle size parallel jobs dominated by read operations, and cluster 3 include large parallel jobs with heavy write operations. The cluster characteristics suggest that using parallel I/O library MPI IO and a large number of parallel cores are important to achieve high I/O throughput.

[1]  Gilles Celeux,et al.  Combining Mixture Components for Clustering , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[2]  Renato Cordeiro de Amorim,et al.  Feature Relevance in Ward’s Hierarchical Clustering Using the Lp Norm , 2015, Journal of Classification.

[3]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[4]  Marianne Winslett,et al.  A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[5]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[6]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[7]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[8]  Marco Aurélio Stelmar Netto,et al.  Job placement advisor based on turnaround predictions for HPC hybrid clouds , 2016, Future Gener. Comput. Syst..

[9]  Alberto Sillitti,et al.  Failure prediction based on log files using Random Indexing and Support Vector Machines , 2013, J. Syst. Softw..

[10]  Kesheng Wu,et al.  DCA-IO: A Dynamic I/O Control Scheme for Parallel and Distributed File Systems , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[11]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[12]  Erik Elmroth,et al.  Towards understanding HPC users and systems: A NERSC case study , 2018, J. Parallel Distributed Comput..

[13]  Erik Elmroth,et al.  Performance Anomaly Detection and Bottleneck Identification , 2015, ACM Comput. Surv..

[14]  Yijia Zhang,et al.  Diagnosing Performance Variations in HPC Applications Using Machine Learning , 2017, ISC.