An experimental approach to performance measurement of heterogeneous parallel applications using CUDA

Heterogeneous parallel systems using GPU devices for application acceleration have garnered significant attention in the supercomputing community. However, to realize the full potential of GPU computing, application developers will require tools to measure and analyze accelerator performance with respect to the parallel execution as a whole. A performance measurement technology for the NVIDIA CUDA platform has been developed and integrated with the TAU parallel performance system. The design of the TAUcuda package is based on an experimental NVIDIA CUDA driver and associated runtime and device libraries. In any environment where the CUDA experimental driver is installed, TAUcuda can provide detailed performance information regarding the execution of GPU kernels and the interactions with the parallel program without any modification to the program source or executable code. The paper describes the TAUcuda technology and how it is integrated with the TAU measurement framework to provide integrated performance views. Various examples of TAUcuda use are presented, including CUDA SDK examples, a GPU version of the Linpack benchmark, and a scalable molecular dynamics application, NAMD.

[1]  Massimiliano Fatica Accelerating linpack with CUDA on heterogenous clusters , 2009, GPGPU-2.

[2]  Dieter Kranzlmüller,et al.  Tools for Scalable Parallel Program Analysis - Vampir VNG and DeWiz , 2004, DAPSYS.

[3]  Laxmikant V. Kale,et al.  Programming Petascale Applications with Charm , 2007 .

[4]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[5]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[6]  Wolfgang E. Nagel,et al.  Event Tracing and Visualization for Cell Broadband Engine Systems , 2008, Euro-Par.

[7]  Wolfgang E. Nagel,et al.  Introducing the Open Trace Format (OTF) , 2006, International Conference on Computational Science.

[8]  Allen D. Malony,et al.  Performance Measurement of Applications with GPU Acceleration using CUDA , 2009, PARCO.

[9]  Laxmikant V. Kalé,et al.  Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..

[10]  Matthias S. Müller,et al.  Tools for scalable parallel program analysis: Vampir NG, MARMOT, and DeWiz , 2009, Int. J. Comput. Sci. Eng..

[11]  Laxmikant V. Kalé,et al.  Integrated Performance Views in Charm++: Projections Meets TAU , 2009, 2009 International Conference on Parallel Processing.

[12]  William Gropp,et al.  From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[13]  Allen D. Malony,et al.  ParaProf: A Portable, Extensible, and Scalable Tool for Parallel Performance Profile Analysis , 2003, Euro-Par.