ALBADross: Active Learning Based Anomaly Diagnosis for Production HPC Systems

Diagnosing causes of performance variations in High-Performance Computing (HPC) systems is a daunting chal-lenge due to the systems' scale and complexity. Variations in application performance result in premature job termination, lower energy efficiency, or wasted computing resources. One potential solution is manual root-cause analysis based on system telemetry data. However, this approach has become an increasingly time-consuming procedure as the process relies on human expertise and the size of telemetry data is voluminous. Recent research employs supervised machine learning (ML) models to diagnose previously encountered performance anomalies in compute nodes automatically. However, these models generally necessitate vast amounts of labeled samples that represent anomalous and healthy states of an application during training. The demand for labeled samples is constraining because gathering labeled samples is difficult and costly, especially considering anomalies that occur infrequently. This paper proposes a novel active learning-based framework that diagnoses previously encountered performance anomalies in HPC systems using significantly fewer labeled samples compared to state-of-the-art ML-based frameworks. Our framework combines an active learning-based query strategy and a supervised classifier to minimize the number of labeled samples required to achieve a target performance score. We evaluate our framework on a production HPC system and a testbed HPC cluster using real and proxy applications. We show that our framework, ALBADross, achieves a 0.95 Fl-score using 28x fewer labeled samples compared to a supervised approach with equal Fl-score, even when there are previously unseen applications and application inputs in the test dataset.

[1]  Thai V. Hoang,et al.  Little Help Makes a Big Difference: Leveraging Active Learning to Improve Unsupervised Time Series Anomaly Detection , 2022, ICSOC Workshops.

[2]  Mohamed H. Sedky,et al.  SALAD: An Exploration of Split Active Learning based Unsupervised Network Data Stream Anomaly Detection using Autoencoders , 2021 .

[3]  Jorge Ortiz,et al.  RLAD: Time Series Anomaly Detection through Reinforcement Learning and Active Learning , 2021, ArXiv.

[4]  Xin Liu,et al.  Sunway supercomputer architecture towards exascale computing: analysis and practice , 2021, Sci. China Inf. Sci..

[5]  Kris Villez,et al.  Active learning for anomaly detection in environmental data , 2020, Environ. Model. Softw..

[6]  Nicholas J. Wright,et al.  Quantifying the impact of network congestion on application performance and network metrics , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).

[7]  Yao Wang,et al.  Practical and White-Box Anomaly Detection through Unsupervised and Active Learning , 2020, 2020 29th International Conference on Computer Communications and Networks (ICCCN).

[8]  Rafal A. Angryk,et al.  MVTS-Data Toolkit: A Python package for preprocessing multivariate time series data , 2020, SoftwareX.

[9]  Luca Benini,et al.  A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems , 2019, Eng. Appl. Artif. Intell..

[10]  Ye Lu,et al.  An Efficient Log Parsing Algorithm Based on Heuristic Rules , 2019, APPT.

[11]  Vitus J. Leung,et al.  HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations , 2019, ICPP.

[12]  Ayse K. Coskun,et al.  Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning , 2019, IEEE Transactions on Parallel and Distributed Systems.

[13]  Klaus Mueller,et al.  A Visual Analytics Framework for the Detection of Anomalous Call Stack Trees in High Performance Computing Applications , 2019, IEEE Transactions on Visualization and Computer Graphics.

[14]  Luca Benini,et al.  Anomaly Detection using Autoencoders in High Performance Computing Systems , 2018, DDC@AI*IA.

[15]  Andreas W. Kempa-Liehr,et al.  Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh - A Python package) , 2018, Neurocomputing.

[16]  Vitus J. Leung,et al.  Taxonomist: Application Detection Through Rich Monitoring Data , 2018, Euro-Par.

[17]  Tiago Pimentel,et al.  Deep Active Learning for Anomaly Detection , 2018, 2020 International Joint Conference on Neural Networks (IJCNN).

[18]  Péter Horváth,et al.  modAL: A modular active learning framework for Python , 2018, ArXiv.

[19]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[20]  Kevin Harms,et al.  Run-to-run Variability on Xeon Phi based Cray XC Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Yijia Zhang,et al.  Diagnosing Performance Variations in HPC Applications Using Machine Learning , 2017, ISC.

[22]  Shi Jin,et al.  Accurate anomaly detection using correlation-based time-series analysis in a core router system , 2016, 2016 IEEE International Test Conference (ITC).

[23]  Andreas W. Kempa-Liehr,et al.  Distributed and parallel time series feature extraction for industrial big data applications , 2016, ArXiv.

[24]  Behnaz Arzani,et al.  Taking the Blame Game out of Data Centers Operations with NetPoirot , 2016, SIGCOMM.

[25]  Sudipto Guha,et al.  Robust Random Cut Forest Based Anomaly Detection on Streams , 2016, ICML.

[26]  Marília Curado,et al.  Expedite feature extraction for enhanced cloud anomaly detection , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[27]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[28]  Rafiul Ahad,et al.  Toward Autonomic Cloud: Automatic Anomaly Detection and Resolution , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[29]  Mahesh Rajan,et al.  Toward Rapid Understanding of Production HPC Applications and Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[30]  Peter N. Brown,et al.  KRIPKE - A MASSIVELY PARALLEL TRANSPORT MINI-APP , 2015 .

[31]  Stephen L. Olivier,et al.  Enabling Advanced Operational Analysis Through Multi-subsystem Data Integration on Trinity. , 2015 .

[32]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Robert B. Ross,et al.  CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[34]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[35]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[36]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[37]  Bianca Schroeder,et al.  Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[38]  Yaguo Lei,et al.  A review on empirical mode decomposition in fault diagnosis of rotating machinery , 2013 .

[39]  Nathaniel H. Hunt,et al.  The Appropriate Use of Approximate Entropy and Sample Entropy with Short Data Sets , 2012, Annals of Biomedical Engineering.

[40]  Gavin C. Cawley,et al.  Baseline Methods for Active Learning , 2011, Active Learning and Experimental Design @ AISTATS.

[41]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[42]  Vincent De Sapio,et al.  Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[43]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[44]  Ling Huang,et al.  Mining Console Logs for Large-Scale System Problem Detection , 2008, SysML.

[45]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2008, J. Assoc. Inf. Sci. Technol..

[46]  Michael A. Bender,et al.  Algorithmic Support for Commodity- Based Parallel Computing Systems , 2003 .

[47]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[48]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[49]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[50]  P. Welch The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms , 1967 .

[51]  Vitus J. Leung,et al.  Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems , 2021, ISC.

[52]  Dong Zhou,et al.  An Active Learning Method Based on Uncertainty and Complexity for Gearbox Fault Diagnosis , 2019, IEEE Access.

[53]  Po-Ching Lin,et al.  An Anomaly Detection Framework Based on ICA and Bayesian Classification for IaaS Platforms , 2016, KSII Trans. Internet Inf. Syst..

[54]  Elisabeth Baseman,et al.  Interpretable Anomaly Detection for Monitoring of High Performance Computing Systems , 2016 .

[55]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[56]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[57]  S. Plimpton,et al.  Fast Parallel Algorithms for Short-Range Molecular DynamJ-zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA , 2022 .