Understanding Performance Interference in Next-Generation HPC Systems

Next-generation systems face a wide range of new potential sources of application interference, including resilience actions, system software adaptation, and in situ analytics programs. In this paper, we present a new model for analyzing the performance of bulk-synchronous HPC applications based on the use of extreme value theory. After validating this model against both synthetic and real applications, the paper then uses both simulation and modeling techniques to profile next-generation interference sources and characterize their behavior and performance impact on a selection of HPC benchmarks, mini-applications, and applications. Lastly, this work shows how the model can be used to understand how current interference mitigation techniques in multi-processors work.

[1]  Torsten Hoefler,et al.  Understanding the Effects of Communication and Coordination on Checkpointing at Scale , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[3]  Carlo Gaetan,et al.  Smoothing Sample Extremes with Dynamic Models , 2004 .

[4]  Peter A. Dinda,et al.  VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[5]  Yuzhi Cai,et al.  Minimum Sample Size Determination for Generalized Extreme Value Distribution , 2010, Commun. Stat. Simul. Comput..

[6]  J. Filliben The Probability Plot Correlation Coefficient Test for Normality , 1975 .

[7]  Karsten Schwan,et al.  PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Scott Pakin,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.

[9]  Torsten Hoefler,et al.  Exploring the effect of noise on the performance benefit of nonblocking allreduce , 2014, EuroMPI/ASIA.

[10]  Kevin T. Pedretti,et al.  The impact of system design parameters on application noise sensitivity , 2010, 2010 IEEE International Conference on Cluster Computing.

[11]  M. E. Galassi,et al.  GNU SCIENTI C LIBRARY REFERENCE MANUAL , 2005 .

[12]  Francesco Pauli,et al.  Penalized likelihood inference in extreme value analyses , 2001 .

[13]  Torsten Hoefler,et al.  Using Simulation to Evaluate the Performance of Resilience Strategies at Scale , 2013, PMBS@SC.

[14]  Scott Klasky,et al.  Grid-based Parallel Data Streaming Implemented for the Gyrokinetic Toroidal Code , 2003 .

[15]  Björn Holmquist,et al.  First moment approximations for order statistics from the extreme value distribution , 2007 .

[16]  Gregory D. Peterson,et al.  An Effective Execution Time Approximation Method for Parallel Computing , 2012, IEEE Transactions on Parallel and Distributed Systems.

[17]  Masato Uchida Traffic data analysis based on extreme value theory and its applications , 2004, IEEE Global Telecommunications Conference, 2004. GLOBECOM '04..

[18]  Gennady Samorodnitsky,et al.  Variable heavy tails in Internet traffic , 2004, Perform. Evaluation.

[19]  Asser N. Tantawi,et al.  Extreme scale computing: Modeling the impact of system noise in multicore clustered systems , 2010, IPDPS.

[20]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[21]  Eric P. Smith,et al.  An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[22]  Peter Hall,et al.  Nonparametric Analysis of Temporal Trend When Fitting Parametric Models to Extreme­Value Data , 2000 .

[23]  S. Coles,et al.  An Introduction to Statistical Modeling of Extreme Values , 2001 .

[24]  Torsten Hoefler,et al.  Netgauge: A Network Performance Measurement Framework , 2007, HPCC.

[25]  A. Jenkinson The frequency distribution of the annual maximum (or minimum) values of meteorological elements , 1955 .

[26]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Susan Coghlan,et al.  The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale , 2006, 2006 IEEE International Conference on Cluster Computing.

[28]  A. Lumsdaine,et al.  LogGOPSim: simulating large-scale applications in the LogGOPS model , 2010, HPDC '10.

[29]  Karsten Schwan,et al.  GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[30]  Brian Gough,et al.  GNU Scientific Library Reference Manual - Third Edition , 2003 .

[31]  Ayala Cohen,et al.  Extreme Percentile Regression , 1996 .

[32]  Feng Pan,et al.  Exploring the energy-time tradeoff in MPI programs on a power-scalable cluster , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[33]  Stephen L. Olivier,et al.  Early experiences with node-level power capping on the Cray XC40 platform , 2015, E2SC '15.

[34]  Allen D. Malony,et al.  The ghost in the machine: observing the effects of kernel operation on parallel application performance , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[35]  Gunnar Blom,et al.  Statistical Estimates and Transformed Beta-Variables. , 1960 .

[36]  Ian Karlin,et al.  LULESH Programming Model and Performance Ports Overview , 2012 .

[37]  Anthony C. Davison,et al.  Local likelihood smoothing of sample extremes , 2000 .

[38]  Irving I. Gringorten,et al.  A plotting rule for extreme probability paper , 1963 .

[39]  Patrick M. Widener,et al.  Scheduling In-Situ Analytics in Next-Generation Applications , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[40]  David E. Bernholdt,et al.  Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R , 2013, ROSS '13.

[41]  Michael Mascagni,et al.  SPRNG: A Scalable Library for Pseudorandom Number Generation , 1999, PP.

[42]  Patrick G. Bridges,et al.  Quantifying Scheduling Challenges for Exascale System Software , 2015, ROSS@HPDC.

[43]  Michael Lang,et al.  System-Level Support for Composition of Applications , 2015, ROSS@HPDC.

[44]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, HiPC 2008.