Performance analysis of multi‐level parallelism: inter‐node, intra‐node and hardware accelerators

The advent of multi‐core processors has made parallel computing techniques mandatory on mainstream systems. With the recent rise in hardware accelerators, hybrid parallelism adds yet another dimension of complexity to the process of software development. The inner workings of a parallel program are usually difficult to understand and verify. This paper presents a tool for graphical program flow analysis of hardware accelerated parallel programs. It monitors the hybrid program execution to record and visualize many performance relevant events along the way. Representative real‐world applications written for both IBM's Cell processor and NVIDIA's CUDA API are studied exemplarily. With our combined monitoring and visualization approach for hardware accelerated multi‐core and multi‐node systems we take the next step in tool evolution towards a highly improved level of detail, precision, and completeness. The contents of this paper is of interest to developers of hardware accelerated applications as well as performance tool architects. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  H Burau,et al.  PIConGPU: A Fully Relativistic Particle-in-Cell Code for a GPU Cluster , 2010, IEEE Transactions on Plasma Science.

[2]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[3]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[4]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[5]  Wolfgang E. Nagel,et al.  Performance Optimization for Large Scale Computing: The Scalable VAMPIR Approach , 2001, International Conference on Computational Science.

[6]  Wolfgang E. Nagel,et al.  Event Tracing and Visualization for Cell Broadband Engine Systems , 2008, Euro-Par.

[7]  Jack J. Dongarra,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[8]  Michael Lang,et al.  Entering the petaflop era: The architecture and performance of Roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Allen D. Malony,et al.  An experimental approach to performance measurement of heterogeneous parallel applications using CUDA , 2010, ICS '10.

[10]  Matthias S. Müller,et al.  Developing Scalable Applications with Vampir, VampirServer and VampirTrace , 2007, PARCO.

[11]  Guido Juckeland,et al.  Comprehensive Performance Tracking with Vampir 7 , 2009, Parallel Tools Workshop.

[12]  Teofilo F. Gonzalez,et al.  Performance data collection using a hybrid approach , 2005, ESEC/FSE-13.

[13]  Wolfgang E. Nagel,et al.  Introducing the Open Trace Format (OTF) , 2006, International Conference on Computational Science.

[14]  Guido Juckeland,et al.  Non-intrusive Performance Analysis of Parallel Hardware Accelerated Applications on Hybrid Architectures , 2010, 2010 39th International Conference on Parallel Processing Workshops.