Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms

In this article, we analyze performance and energy consumption of five OpenMP runtime systems over a non-uniform memory access (NUMA) platform. We also selected three CPU-level optimizations or techniques to evaluate their impact on the runtime systems: processors features Turbo Boost and C-States, and CPU Dynamic Voltage and Frequency Scaling through Linux CPUFreq governors. We present an experimental study to characterize OpenMP runtime systems on the three main kernels in dense linear algebra algorithms (Cholesky, LU, and QR) in terms of performance and energy consumption. Our experimental results suggest that OpenMP runtime systems can be considered as a new energy leverage, and Turbo Boost, as well as C-States, impacted significantly performance and energy. CPUFreq governors had more impact with Turbo Boost disabled, since both optimizations reduced performance due to CPU thermal limits. An LU factorization with concurrent-write extension from libKOMP achieved up to 63% of performance gain and 29% of energy decrease over original PLASMA algorithm using GNU C compiler (GCC) libGOMP runtime.

[1]  Denis Trystram,et al.  Decentralized list scheduling , 2011, Ann. Oper. Res..

[2]  Bronis R. de Supinski,et al.  Adagio: making DVS practical for complex HPC applications , 2009, ICS.

[3]  Vijayalakshmi Srinivasan,et al.  Special Issue on Network and Parallel Computing , 2015, International Journal of Parallel Programming.

[4]  Tiziano De Matteis,et al.  Proactive elasticity and energy awareness in data stream processing , 2017, J. Syst. Softw..

[5]  Laurent Lefèvre,et al.  Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms , 2017, 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW).

[6]  Laurent Lefèvre,et al.  A survey on techniques for improving the energy efficiency of large-scale distributed systems , 2014, ACM Comput. Surv..

[7]  Zhao Zhang,et al.  Automatic runtime frequency-scaling system for energy savings in parallel applications , 2013, The Journal of Supercomputing.

[8]  Dong Li,et al.  Hybrid MPI/OpenMP power-aware computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[9]  Fabrice Rastello,et al.  Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures , 2016, Euro-Par.

[10]  Dong Li,et al.  Model-based, memory-centric performance and power optimization on NUMA multiprocessors , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Marco Danelutto,et al.  Mammut: High-level management of system knobs and sensors , 2017, SoftwareX.

[12]  Samuel Thibault,et al.  Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite , 2014, IWOMP.

[13]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[14]  Thierry Gautier,et al.  libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms , 2012, IWOMP.

[15]  Allan Porterfield,et al.  Using Dynamic Duty Cycle Modulation to Improve Energy Efficiency in High Performance Computing , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[16]  Xu Yang,et al.  Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  John Cavazos,et al.  Using Per-Loop CPU Clock Modulation for Energy Efficiency in OpenMP Applications , 2015, 2015 44th International Conference on Parallel Processing.

[18]  Marco Danelutto,et al.  Efficient NAS Benchmark Kernels with C++ Parallel Programming , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[19]  Feng Pan,et al.  Analyzing the Energy-Time Trade-Off in High-Performance Computing Applications , 2007, IEEE Transactions on Parallel and Distributed Systems.

[20]  Jeremie Lagraviere,et al.  Evaluation of the power efficiency of UPC, OpenMP and MPI , 2015 .

[21]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[22]  Manuel Prieto,et al.  Survey of Energy-Cognizant Scheduling Techniques , 2013, IEEE Transactions on Parallel and Distributed Systems.

[23]  Jack J. Dongarra,et al.  Porting the PLASMA Numerical Library to the OpenMP Standard , 2017, International Journal of Parallel Programming.

[24]  Thomas Ilsche,et al.  Software Controlled Clock Modulation for Energy Efficiency Optimization on Intel Processors , 2016, 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC).

[25]  Stephen L. Olivier,et al.  Power Measurement and Concurrency Throttling for Energy Reduction in OpenMP Programs , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[26]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[27]  Mateo Valero,et al.  Utilization driven power-aware parallel job scheduling , 2010, Computer Science - Research and Development.

[28]  Ananta Tiwari,et al.  Efficient speed (ES): Adaptive DVFS and clock modulation for energy efficiency , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[29]  Robert Schöne,et al.  Integrating performance analysis and energy efficiency optimizations in a unified environment , 2013, Computer Science - Research and Development.

[30]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[31]  Barbara M. Chapman,et al.  ARCS: Adaptive Runtime Configuration Selection for Power-Constrained OpenMP Applications , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[32]  Anne Benoit,et al.  Shutdown Policies with Power Capping for Large Scale Computing Systems , 2017, Euro-Par.

[33]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[34]  Dimitrios S. Nikolopoulos,et al.  Online strategies for high-performance power-aware thread execution on emerging multiprocessors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[35]  Kirk W. Cameron,et al.  Energy and performance characteristics of different parallel implementations of scientific applications on multicore systems , 2011, Int. J. High Perform. Comput. Appl..

[36]  Yu David Liu,et al.  Energy-efficient work-stealing language runtimes , 2014, ASPLOS.

[37]  Xiao Zhang,et al.  Hardware Execution Throttling for Multi-core Resource Management , 2009, USENIX Annual Technical Conference.

[38]  Martin Schulz,et al.  A Run-Time System for Power-Constrained HPC Applications , 2015, ISC.

[39]  Barbara M. Chapman,et al.  Power and Energy Footprint of OpenMP Programs Using OpenMP Runtime API , 2014, 2014 Energy Efficient Supercomputing Workshop.

[40]  Efraim Rotem,et al.  Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge , 2012, IEEE Micro.