Lightweight Measurement and Analysis of HPC Performance Variability

Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.

[1]  Debbie J. Dupuis,et al.  A Comparison of confidence intervals for generalized extreme-value distributions , 1998 .

[2]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[3]  Philipp Leitner,et al.  Patterns in the Chaos—A Study of Performance Variation and Predictability in Public IaaS Clouds , 2014, ACM Trans. Internet Techn..

[4]  Jan Kyselý,et al.  A Cautionary Note on the Use of Nonparametric Bootstrap for Estimating Uncertainties in Extreme-Value Models , 2008 .

[5]  Gregory D. Peterson,et al.  An Effective Execution Time Approximation Method for Parallel Computing , 2012, IEEE Transactions on Parallel and Distributed Systems.

[6]  J. R. Wallis,et al.  Estimation of the generalized extreme-value distribution by the method of probability-weighted moments , 1985 .

[7]  Gordon B Drummond,et al.  Making do with what we have: use your bootstraps. , 2012, Advances in physiology education.

[8]  Toshio Nakatani,et al.  Performance variations of two open-source cloud platforms , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[9]  Vitus J. Leung,et al.  HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations , 2019, ICPP.

[10]  Brian Kocoloski Scalability in the Presence of Variability , 2018 .

[11]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[12]  Jack J. Dongarra,et al.  High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems , 2016, Int. J. High Perform. Comput. Appl..

[13]  Ryan E. Grant,et al.  NiMC: Characterizing and Eliminating Network-Induced Memory Contention , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14]  Patrick M. Widener,et al.  Understanding Performance Interference in Next-Generation HPC Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Robert J. Fowler,et al.  Variability: A Tuning Headache , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[16]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[17]  C. Thompson,et al.  Mixed methods for fitting the GEV distribution , 2011 .

[18]  Ayse K. Coskun,et al.  Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning , 2019, IEEE Transactions on Parallel and Distributed Systems.

[19]  Rushil Anirudh,et al.  The Case of Performance Variability on Dragonfly-based Systems , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20]  Laxmikant V. Kalé,et al.  Variation Among Processors Under Turbo Boost in HPC Systems , 2016, ICS.

[21]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[22]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  H. Madsen,et al.  Comparison of annual maximum series and partial duration series methods for modeling extreme hydrologic events: 1. At‐site modeling , 1997 .

[25]  Shantenu Jha,et al.  Exploring the Performance Fluctuations of HPC Workloads on Clouds , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.