Case Studies of Multi-core Energy Efficiency in Task Based Programs

In this paper, we present three performance and energy case studies of benchmark applications in the OmpSs environment for task based programming. Different parallel and vectorized implementations are evaluated on an Intel® CoreTMi7-2600 quad-core processor. Using FLOPS/W derived from chip MSR registers, we find AVX code to be clearly most energy efficient in general. The peak on-chip GFLOPS/W rates are: Black-Scholes (BS) 0.89, FFTW 1.38 and Matrix Multiply (MM) 1.97. Experiments cover variable degrees of thread parallelism and different OmpSs task pool scheduling policies. We find that maximum energy efficiency for small and medium sized problems is obtained by limiting the number of parallel threads. Comparison of AVX variants with non-vectorized code shows ≈6−7 × (BS) and ≈3−5 × (FFTW) improvements in on-chip energy efficiency, depending on the problem size and degree of multithreading.

[1]  Jian Li,et al.  Power-performance considerations of parallel computing on chip multiprocessors , 2005, TACO.

[2]  Enrique S. Quintana-Ortí,et al.  Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors , 2011, Computer Science - Research and Development.

[3]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[4]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[5]  Dong Li,et al.  PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications , 2010, IEEE Transactions on Parallel and Distributed Systems.

[6]  Christoforos E. Kozyrakis,et al.  Models and Metrics to Enable Energy-Efficiency Optimizations , 2007, Computer.

[7]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[9]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[10]  Wolfgang E. Nagel,et al.  Flexible workload generation for HPC cluster efficiency benchmarking , 2012, Computer Science - Research and Development.