Energy and performance characteristics of different parallel implementations of scientific applications on multicore systems

Energy consumption is a major concern with high-performance multicore systems. In this paper, we explore the energy consumption and performance (execution time) characteristics of different parallel implementations of scientific applications. In particular, the experiments focus on message-passing interface (MPI)-only versus hybrid MPI/OpenMP implementations for hybrid the NAS (NASA Advanced Supercomputing) BT (Block Tridiagonal) benchmark (strong scaling), a Lattice Boltzmann application (strong scaling), and a Gyrokinetic Toroidal Code — GTC (weak scaling), as well as central processing unit (CPU) frequency scaling. Experiments were conducted on a system instrumented to obtain power information; this system consists of eight nodes with four cores per node. The results indicate, with respect to the MPI-only versus the hybrid implementation, that the best implementation is dependent upon the application executed on 16 or fewer cores. For the case of 32 cores, the results were consistent in that hybrid implementation resulted in less execution time and energy. With CPU frequency scaling, the best case for energy saving was not the best case for execution time.

[1]  David K. Lowenthal,et al.  Using multiple energy gears in MPI programs on a power-scalable cluster , 2005, PPoPP.

[2]  David K. Lowenthal,et al.  Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs , 2005 .

[3]  Shuaiwen Song,et al.  Energy Profiling and Analysis of the HPC Challenge Benchmarks , 2009, Int. J. High Perform. Comput. Appl..

[4]  Xingfu Wu,et al.  Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications , 2003, PERV.

[5]  David K. Lowenthal,et al.  Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[6]  Xingfu Wu,et al.  Performance Analysis, Modeling and Prediction of a Parallel Multiblock Lattice Boltzmann Application Using Prophesy System , 2006, 2006 IEEE International Conference on Cluster Computing.

[7]  Bronis R. de Supinski,et al.  Adagio: making DVS practical for complex HPC applications , 2009, ICS.

[8]  Phillip A. Laplante Performance Analysis and Optimization , 2004 .

[9]  Xingfu Wu,et al.  Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers , 2011, PERV.

[10]  Dong Li,et al.  PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications , 2010, IEEE Transactions on Parallel and Distributed Systems.

[11]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[12]  Shuaiwen Song,et al.  Iso-Energy-Efficiency: An Approach to Power-Constrained Parallel Computation , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[13]  Dong Li,et al.  Hybrid MPI/OpenMP power-aware computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Xingfu Wu,et al.  Performance Analysis and Optimization of Parallel Scientific Applications on CMP Clusters , 2009, Scalable Comput. Pract. Exp..

[15]  Dimitrios S. Nikolopoulos,et al.  Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes , 2008, IEEE Transactions on Parallel and Distributed Systems.