Energy Analysis of Parallel Scientific Kernels on Multiple GPUs

A dramatic improvement in energy efficiency is mandatory for sustainable supercomputing and has been identified as a major challenge. Affordable energy solution continues to be of great concern in the development of the next generation of supercomputers. Low power processors, dynamic control of processor frequency and heterogeneous systems are being proposed to mitigate energy costs. However, the entire software stack must be re-examined with respect to its ability to improve efficiency in terms of energy as well as performance. In order to address this need, a better understanding of the energy behavior of applications is essential. In this paper we explore the energy efficiency of some common kernels used in high performance computing on a multi-GPU platform, and compare our results with multicore CPUs. We implement these kernels using optimized libraries like FFTW, CUBLAS and MKL. Our experiments demonstrate a relationship between energy consumption and computation-communication factors of certain application kernels. In general, we observe that the correlation of energy consumption to GPU global memory accesses is 0.73 and power consumption to operations per unit time is 0.84, signifying a strong positive relationship between them. We believe that our results will assist the HPC community in understanding the power/energy behavior of scientific kernels on multi-GPU platforms.

[1]  Jerome Spanier,et al.  Dynamic creation of pseudorandom number generators , 2000 .

[2]  J. Demmel,et al.  Sun Microsystems , 1996 .

[3]  M. Matsumoto,et al.  Parallel Mersenne Twister , 2007 .

[4]  Gilberto Contreras,et al.  Power prediction for Intel XScale processors using performance monitoring unit events , 2005 .

[5]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[6]  Wolfgang E. Nagel,et al.  Flexible workload generation for HPC cluster efficiency benchmarking , 2012, Computer Science - Research and Development.

[7]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[8]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[9]  Klaus-Dieter Lange,et al.  Identifying Shades of Green: The SPECpower Benchmarks , 2009, Computer.

[10]  Song Huang,et al.  On the energy efficiency of graphics processing units for scientific computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Boyana Norris,et al.  A component infrastructure for performance and power modeling of parallel scientific applications , 2008, CBHPC '08.

[12]  Stewart A. Levin,et al.  Principle of reverse-time migration , 1984 .

[13]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Jack J. Dongarra,et al.  End-user Tools for Application Performance Analysis Using Hardware Counters , 2001, ISCA PDCS.

[15]  Kai Lu,et al.  A Power Provision and Capping Architecture for Large Scale Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[16]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[17]  Ragunathan Rajkumar,et al.  Critical power slope: understanding the runtime effects of frequency scaling , 2002, ICS '02.

[18]  Alon Naveh,et al.  Power and Thermal Management in the Intel Core Duo Processor , 2006 .

[19]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[20]  Sally A. McKee,et al.  Real time power estimation and thread scheduling via performance counters , 2009, CARN.

[21]  Volker Lindenstruth,et al.  Optimized HPL for AMD GPU and multi-core CPU usage , 2011, Computer Science - Research and Development.

[22]  Kai Lu,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.

[23]  Margaret Martonosi,et al.  Formal control techniques for power-performance management , 2005, IEEE Micro.

[24]  Scott Shenker,et al.  Scheduling for reduced CPU energy , 1994, OSDI '94.

[25]  J. Kunkel HDTrace – A Tracing and Simulation Environment of Application and System Interaction , 2011 .

[26]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[27]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Margaret Martonosi,et al.  Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data , 2003, MICRO.

[29]  Satoshi Matsuoka,et al.  Statistical power modeling of GPU kernels using performance counters , 2010, International Conference on Green Computing.

[30]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.