Anomaly Detection in High Performance Computers: A Vicinity Perspective

In response to the demand for higher computational power, the number of computing nodes in high performance computers (HPC) increases rapidly. Exascale HPC systems are expected to arrive by 2020. With drastic increase in the number of HPC system components, it is expected to observe a sudden increase in the number of failures which, consequently, poses a threat to the continuous operation of the HPC systems. Detecting failures as early as possible and, ideally, predicting them, is a necessary step to avoid interruptions in HPC systems operation. Anomaly detection is a well-known general purpose approach for failure detection, in computing systems. The majority of existing methods are designed for specific architectures, require adjustments on the computing systems hardware and software, need excessive information, or pose a threat to users' and systems' privacy. This work proposes a node failure detection mechanism based on a vicinity-based statistical anomaly detection approach using passively collected and anonymized system log entries. Application of the proposed approach on system logs collected over 8 months indicates an anomaly detection precision between 62% to 81%.

[1]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  R. Service,et al.  China's planned exascale computer threatens Summit's position at the top. , 2018, Science.

[3]  Dhabaleswar K. Panda,et al.  Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[4]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, , 2002 .

[5]  Peter Filzmoser,et al.  Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection , 2018, Comput. Secur..

[6]  Wolfgang E. Nagel,et al.  Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[7]  Anthony A. Maciejewski,et al.  Optimizing checkpoint intervals for reduced energy use in exascale systems , 2017, 2017 Eighth International Green and Sustainable Computing Conference (IGSC).

[8]  Risto Vaarandi,et al.  LogCluster - A data clustering and pattern mining algorithm for event logs , 2015, 2015 11th International Conference on Network and Service Management (CNSM).

[9]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[10]  Christian Engelmann,et al.  Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[11]  Akio Watanabe,et al.  Proactive failure detection learning generation patterns of large-scale network logs , 2015, 2015 11th International Conference on Network and Service Management (CNSM).

[12]  Jannis Klinkenberg,et al.  Data Mining-Based Analysis of HPC Center Operations , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Bianca Schroeder,et al.  Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[14]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[15]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[16]  Florina M. Ciorba,et al.  Assessing Data Usefulness for Failure Analysis in Anonymized System Logs , 2018, 2018 17th International Symposium on Parallel and Distributed Computing (ISPDC).

[17]  Al Geist,et al.  A survey of high-performance computing scaling challenges , 2017, Int. J. High Perform. Comput. Appl..

[18]  Risto Vaarandi,et al.  An unsupervised framework for detecting anomalous messages from syslog log files , 2018, NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium.

[19]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  Christian Engelmann,et al.  A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[21]  Nithin Nakka,et al.  Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[22]  Frank Mueller,et al.  Desh: deep learning for system health prediction of lead times to failure in HPC , 2018, HPDC.

[23]  Florina M. Ciorba,et al.  Anonymization of System Logs for Preserving Privacy and Reducing Storage , 2018 .

[24]  Vipin Kumar,et al.  Anomaly Detection for Discrete Sequences: A Survey , 2012, IEEE Transactions on Knowledge and Data Engineering.

[25]  Franck Cappello,et al.  LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[26]  Ke Zhang,et al.  2016 Ieee International Conference on Big Data (big Data) Automated It System Failure Prediction: a Deep Learning Approach , 2022 .