High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation

The Cray XC30 represents the first appearance of the dragonfly interconnect topology in a product from a major HPC vendor. The question of how well applications perform on such a machine naturally arises. We consider the performance of an algebraic multigrid solver on an XC30 and develop a performance model for its solve cycle. We use this model to both analyze its performance and guide data redistribution at runtime aimed at improving it by trading messages for increased computation. The performance modeling results demonstrate the ability of the dragonfly interconnect to avoid network contention, but speedups when using the redistribution scheme were enough to raise questions about the ability of the dragonfly topology to handle very communication-intensive

[1]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[2]  Samuel Williams,et al.  Performance Tuning of Scientific Applications , 2010 .

[3]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[4]  Gerhard Wellein,et al.  Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..

[5]  Dakai Zhu,et al.  Reliability-aware Dynamic Voltage Scaling for energy-constrained real-time embedded systems , 2008, 2008 IEEE International Conference on Computer Design.

[6]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[7]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[8]  Laxmikant V. Kalé,et al.  A ‘cool’ way of improving the reliability of HPC machines , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  F. Frances Yao,et al.  A scheduling model for reduced CPU energy , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[10]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[11]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[12]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[13]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[14]  M.K. Patterson,et al.  The effect of data center temperature on energy efficiency , 2008, 2008 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems.

[15]  Albert Y. Zomaya,et al.  Multiple Frequency Selection in DVFS-Enabled Processors to Minimize Energy Consumption , 2012, ArXiv.

[16]  Robert Strzodka,et al.  Impact of System and Cache Bandwidth on Stencil Computations Across Multiple Processor Generations , 2011 .

[17]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[18]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[19]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.