KPIs-Based Clustering and Visualization of HPC Jobs: A Feature Reduction Approach

High-Performance Computing (HPC) systems need to be constantly monitored to ensure their stability. The monitoring systems collect a tremendous amount of data about different parameters or Key Performance Indicators (KPIs), such as resource usage, IO waiting time, etc. A proper analysis of this data, usually stored as time series, can provide insight in choosing the right management strategies as well as the early detection of issues. In this paper, we introduce a methodology to cluster HPC jobs according to their KPI indicators. Our approach reduces the inherent high dimensionality of the collected data by applying two techniques to the time series: literature-based and variance-based feature extraction. We also define a procedure to visualize the obtained clusters by combining the two previous approaches and the Principal Component Analysis (PCA). Finally, we have validated our contributions on a real data set to conclude that those KPIs related to CPU usage provide the best cohesion and separation for clustering analysis and the good results of our visualization methodology.

[1]  Bo Zhang,et al.  Data-Driven Sales Leads Prediction for Everything-as-a-Service in the Cloud , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[2]  M. N. Vora,et al.  Hadoop-HBase for large-scale data , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[3]  Chang-Dong Wang,et al.  Weighted Multi-view Clustering with Feature Selection , 2016, Pattern Recognit..

[4]  T. F. Pena,et al.  Big Data in metagenomics: Apache Spark vs MPI , 2020, PloS one.

[5]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[6]  Petros Xanthopoulos,et al.  Estimating the number of clusters in a dataset via consensus clustering , 2019, Expert Syst. Appl..

[7]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[8]  Dong Ryeol Shin,et al.  Hadoop based Demography Big Data Management System , 2018, 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[9]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[10]  Yijia Zhang,et al.  Diagnosing Performance Variations in HPC Applications Using Machine Learning , 2017, ISC.

[11]  Seema Sharma,et al.  Classification Through Machine Learning Technique: C4. 5 Algorithm based on Various Entropies , 2013 .

[12]  Ayse K. Coskun,et al.  Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning , 2019, IEEE Transactions on Parallel and Distributed Systems.

[13]  Hsiang-Fu Yu,et al.  Think Globally, Act Locally: A Deep Neural Network Approach to High-Dimensional Time Series Forecasting , 2019, NeurIPS.

[14]  Xiaozhe Wang,et al.  Dimension Reduction for Clustering Time Series Using Global Characteristics , 2005, International Conference on Computational Science.

[15]  Simon J. Perkins,et al.  Genetic Algorithms and Support Vector Machines for Time Series Classification , 2002, Optics + Photonics.

[16]  Andreas W. Kempa-Liehr,et al.  Distributed and parallel time series feature extraction for industrial big data applications , 2016, ArXiv.

[17]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[18]  Weiwei Liu,et al.  Sparse Embedded k-Means Clustering , 2017, NIPS.

[19]  Karel J. Keesman,et al.  Monitoring Support for Water Distribution Systems based on Pressure Sensor Data , 2019, Water Resources Management.

[20]  Yang Zhang,et al.  Unsupervised Feature Extraction for Time Series Clustering Using Orthogonal Wavelet Transform , 2006, Informatica.

[21]  Christos Verikoukis,et al.  Big Data for 5G Intelligent Network Slicing Management , 2020, IEEE Network.

[22]  Xinwang Liu,et al.  K-Means Clustering With Incomplete Data , 2019, IEEE Access.

[23]  Andreas W. Kempa-Liehr,et al.  Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh - A Python package) , 2018, Neurocomputing.

[24]  Mayank Bansal,et al.  Astro: A predictive model for anomaly detection and feedback-based scheduling on Hadoop , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[25]  Nikolai Helwig,et al.  Automatic feature extraction and selection for condition monitoring and related datasets , 2018, 2018 IEEE International Instrumentation and Measurement Technology Conference (I2MTC).

[26]  Luis Gravano,et al.  k-Shape: Efficient and Accurate Clustering of Time Series , 2016, SGMD.

[27]  Ren Wang,et al.  Simulating Hive Cluster for Deployment Planning, Evaluation and Optimization , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[28]  Hala S. Own,et al.  Unsupervised clustering of service performance behaviors , 2018, Inf. Sci..

[29]  Shahrel Azmin Suandi,et al.  Hybrid Human Skin Detection Using Neural Network and K-Means Clustering Technique , 2015, Appl. Soft Comput..

[30]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[31]  Bogdan Gabrys,et al.  Meta-learning for time series forecasting and forecast combination , 2010, Neurocomputing.

[32]  Rudolf Konrad Fruhwirth,et al.  A Statistical Feature-Based Approach for Operations Recognition in Drilling Time Series , 2012, CISIM 2012.

[33]  Xiaozhe Wang,et al.  Characteristic-Based Clustering for Time Series Data , 2006, Data Mining and Knowledge Discovery.

[34]  Guojun Gan,et al.  K-means Clustering with Outlier Removal , 2017, Pattern Recognit. Lett..

[35]  Kamin Whitehouse,et al.  High-dimensional Time Series Clustering via Cross-Predictability , 2017, AISTATS.

[36]  Shi Jin,et al.  Accurate anomaly detection using correlation-based time-series analysis in a core router system , 2016, 2016 IEEE International Test Conference (ITC).

[37]  Rob J. Hyndman,et al.  Large-Scale Unusual Time Series Detection , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[38]  Jannis Klinkenberg,et al.  Data Mining-Based Analysis of HPC Center Operations , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[39]  Pasquale Lops,et al.  Introducing linked open data in graph-based recommender systems , 2017, Inf. Process. Manag..

[40]  Sifat Ahmed,et al.  Fake Review Detection using Principal Component Analysis and Active Learning , 2019 .

[41]  Xiao Zhong,et al.  Forecasting daily stock market return using dimensionality reduction , 2017, Expert Syst. Appl..

[42]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Olivier Markowitch,et al.  Feature Extraction and Feature Selection: Reducing Data Complexity With Apache Spark , 2017, ArXiv.

[44]  Mohd Vasim Ahamad,et al.  An Improved Method for Image Segmentation Using K-Means Clustering with Neutrosophic Logic , 2018 .

[45]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[46]  Nick S. Jones,et al.  Highly Comparative Feature-Based Time-Series Classification , 2014, IEEE Transactions on Knowledge and Data Engineering.

[47]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[48]  Martin Schulz,et al.  Reducing False Node Failure Predictions in HPC , 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[49]  Anju Bala,et al.  Analyzing Twitter sentiments through big data , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[50]  Nikhil Vivek Talpallikar High-Performance Cloud Computing: VCL Case Study. , 2012 .

[51]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[52]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[53]  Xiaozhe Wang,et al.  Rule induction for forecasting method selection: Meta-learning the characteristics of univariate time series , 2009, Neurocomputing.

[54]  B. Hulsegge,et al.  A time-series approach for clustering farms based on slaughterhouse health aberration data. , 2018, Preventive veterinary medicine.

[55]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[56]  Glenn Fung,et al.  Multicategory Proximal Support Vector Machine Classifiers , 2005, Machine Learning.

[57]  Humera Tariq,et al.  K-Means Cluster Analysis for Image Segmentation , 2014 .

[58]  Zhang Rong,et al.  Feedforward Neural Network for Time Series Anomaly Detection , 2018, ArXiv.

[59]  José Manuel Benítez,et al.  Fault detection based on time series modeling and multivariate statistical process control , 2018, Chemometrics and Intelligent Laboratory Systems.

[60]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[61]  Yueqing Wang,et al.  A Comprehensive Analysis of User Job Data on a Petascale Supercomputer Dedicated to CFD , 2019, 2019 IEEE 5th International Conference on Computer and Communications (ICCC).

[62]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[63]  Lin Zhang,et al.  Discriminative low-rank preserving projection for dimensionality reduction , 2019, Appl. Soft Comput..

[64]  Assaf Schuster,et al.  Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[65]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[66]  Abd Rasid Mamat,et al.  Silhouette index for determining optimal k-means clustering on images in different color models , 2018 .

[67]  Li Ai Dimensionality Reduction and Similarity Search in Large Time Series Databases , 2005 .

[68]  Bo Jiang,et al.  Multi-view clustering via simultaneous weighting on views and features , 2016, Appl. Soft Comput..

[69]  Miin-Shen Yang,et al.  A Feature-Reduction Multi-View k-Means Clustering Algorithm , 2019, IEEE Access.

[70]  Francesc Pozo,et al.  Structural Health Monitoring for Jacket-Type Offshore Wind Turbines: Experimental Proof of Concept , 2020, Sensors.

[71]  Dan Pei,et al.  Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection , 2018, 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).

[72]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[73]  Gerhard Schmitt,et al.  Feature Extraction and K-means Clustering Approach to Explore Important Features of Urban Identity , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[74]  Bernd Bischl,et al.  Benchmark for filter methods for feature selection in high-dimensional classification data , 2020, Comput. Stat. Data Anal..

[75]  Rebeca P. Díaz Redondo,et al.  Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers , 2020, Sensors.