The Case of Performance Variability on Dragonfly-based Systems

Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology – specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.

[1]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Torsten Hoefler,et al.  Mitigating network noise on Dragonfly networks through application-aware routing , 2019, SC.

[4]  Jens Domke,et al.  Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Ayse K. Coskun,et al.  Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning , 2019, IEEE Transactions on Parallel and Distributed Systems.

[6]  P Nowak,et al.  Radiation transport calculations on unstructured grids using a spatially decomposed and threaded algorithm , 1999 .

[7]  Zhou Tong,et al.  A comparative study of SDN and adaptive routing on dragonfly networks , 2017, SC.

[8]  Hao Lu,et al.  Distributed Louvain Algorithm for Graph Community Detection , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[9]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Nicholas J. Wright,et al.  Understanding Performance Variability on the Aries Dragonfly Network , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[13]  Mike Higgins,et al.  Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[15]  C. DeTar,et al.  Scaling tests of the improved Kogut-Susskind quark action , 1999, hep-lat/9912018.

[16]  Ravishankar K. Iyer,et al.  Characterizing Supercomputer Traffic Networks Through Link-Level Analysis , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Robert D. Falgout,et al.  The Design and Implementation of hypre, a Library of Parallel High Performance Preconditioners , 2006 .

[18]  Robert B. Ross,et al.  Watch Out for the Bully! Job Interference Study on Dragonfly Network , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Kevin Harms,et al.  Run-to-run Variability on Xeon Phi based Cray XC Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.