Making Sense of Performance Counter Measurements on Supercomputing Applications

The computation nodes of modern supercomputers consist of multiple multicore chips. Many scientific and engineering application codes have been migrated to these systems with little or no optimization for multicore architectures, effectively using only a fraction of the number of cores on each chip or achieving suboptimal performance from the cores they do utilize. Performance optimization on these systems require both different measurements and different optimization techniques than those for single core chips. This paper describes primary performance bottlenecks unique to multicore chips, sketching the roles that several commonly used measurement tools can most effectively play in performance optimization. The HOMME benchmark code from NCAR is used as a representative case study on several multicore based supercomputers to formulate and interpret measurements and derive characterizations relevant to modern multicore performance bottlenecks. Finally, we describe common pitfalls in performance measurements on multicore chips and how they may be avoided along with a novel high level multicore optimization technique that increased performance up to 35%.

[1]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[2]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[3]  Vijay Janapa Reddi,et al.  PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.

[4]  Michael Lang,et al.  Entering the petaflop era: The architecture and performance of Roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Nathan R. Tallent,et al.  Binary analysis for measurement and attribution of program performance , 2009, PLDI '09.

[6]  Samuel Williams,et al.  Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..

[7]  Matthias Hauswirth,et al.  We have it easy, but do we have it right? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[8]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[9]  Ramkumar Jayaseelan,et al.  Investigating the impact of code generation on performance characteristics of integer programs , 2010, INTERACT-14.

[10]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Nathan R. Tallent,et al.  Diagnosing performance bottlenecks in emerging petascale applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[12]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[13]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org , 2010 .

[14]  J. Hack,et al.  Description of the NCAR Community Climate Model (CCM1) , 1987 .

[15]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[16]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[17]  J. Dongarra,et al.  The Impact of Multicore on Computational Science Software , 2007 .