An Execution Fingerprint Dictionary for HPC Application Recognition

Applications running on HPC systems waste time and energy if they: (a) use resources inefficiently, (b) deviate from allocation purpose (e.g. cryptocurrency mining), or (c) encounter errors and failures. It is important to know which applications are running on the system, how they use the system, and whether they have been executed before. To recognize known applications during execution on a noisy system, we draw inspiration from the way Shazam recognizes known songs playing in a crowded bar. Our contribution is an Execution Fingerprint Dictionary (EFD) that stores execution fingerprints of system metrics (keys) linked to application and input size information (values) as key-value pairs for application recognition. Related work often relies on extensive system monitoring (many system metrics collected over large time windows) and employs machine learning methods to identify applications. Our solution only uses the first 2 minutes and a single system metric to achieve F-scores above 95 percent, providing comparable results to related work but with a fraction of the necessary data and a straightforward mechanism of recognition.

[1]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[2]  Yuichi Tsujita,et al.  Classifying Jobs and Predicting Applications in HPC Systems , 2018, ISC.

[3]  Ali Yahyaouy,et al.  Gath-Geva clustering algorithm for high performance computing (HPC) monitoring , 2019, 2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS).

[4]  Norman Bourassa,et al.  Operational Data Analytics: Optimizing the National Energy Research Scientific Computing Center Cooling Systems , 2019, ICPP Workshops.

[5]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[6]  Michael E. Papka,et al.  Characterization and identification of HPC applications at leadership computing facility , 2020, ICS.

[7]  Vitus J. Leung,et al.  Taxonomist: Application Detection Through Rich Monitoring Data , 2018, Euro-Par.

[8]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Naixue Xiong,et al.  An approach for matching communication patterns in parallel applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[10]  Pierre Manneback,et al.  An Accurate Tool for Modeling, Fingerprinting, Comparison, and Clustering of Parallel Applications Based on Performance Counters , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[11]  2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS) , 2019 .

[12]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[13]  Denis Trystram,et al.  Improving backfilling by using machine learning to predict running times , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Torsten Wilde,et al.  Predicting the Energy and Power Consumption of Strong and Weak Scaling HPC Applications , 2014, Supercomput. Front. Innov..

[15]  Avery Wang,et al.  The Shazam music recognition service , 2006, CACM.

[16]  Natalie Bates,et al.  Global Experiences with HPC Operational Data Measurement, Collection and Analysis , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).