APEX Introspection APEX Policy Engine APEX State Application OpenMP Runtime Synchronous Asynchronous Triggered

Power is the most critical resource for the exascale high performance computing. In the future, system administrators might have to pay attention to the power consumption of the machine under different work loads. Hence, each application may have to run with an allocated power budget. Thus, achieving the best performance on future machines requires optimal performance subject to a power constraint. This additional performance requirement should not be the responsibility of HPC (High Performance Computing) application developers. Optimizing the performance for a given power budget should be the responsibility of high-performance system software stack. Modern machines allow power capping of CPU and memory to implement power budgeting strategy. Finding the best runtime environment for a node at a given power level is important to get the best performance. This paper presents ARCS (Adaptive Runtime Configuration Selection) framework that automatically selects the best runtime configuration for each OpenMP parallel region at a given power level. The framework uses OMPT (OpenMP Tools) API, APEX (Autonomic Performance Environment for eXascale), and Active Harmony frameworks to explore configuration search space and selects the best number of threads, scheduling policy, and chunk size for a given power level at run-time. We test ARCS using the NAS Parallel Benchmark, and proxy application LULESH with Intel Sandybridge, and IBM Power multi-core architectures. We show that for a given power level, efficient OpenMP runtime parameter selection can improve the execution time and energy consumption of an application up to 40% and 42% respectively.

[1]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[2]  Allen D. Malony,et al.  An Autonomic Performance Environment for Exascale , 2015, Supercomput. Front. Innov..

[3]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[4]  Dong Li,et al.  Hybrid MPI/OpenMP power-aware computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[5]  Alvin M. Despain,et al.  Cache designs for energy efficiency , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[6]  Robert J. Fowler,et al.  An early prototype of an autonomic performance environment for exascale , 2013, ROSS '13.

[7]  Thomas L. Sterling,et al.  A Dynamic Execution Model Applied to Distributed Collision Detection , 2014, ISC.

[8]  Martin Schulz,et al.  Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems , 2014, 2014 43rd International Conference on Parallel Processing.

[9]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  David A. Wood,et al.  Cache Power Budgeting for Performance , 2013 .

[11]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[12]  Indrani Paul,et al.  Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[13]  John Cavazos,et al.  Using Per-Loop CPU Clock Modulation for Energy Efficiency in OpenMP Applications , 2015, 2015 44th International Conference on Parallel Processing.

[14]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[15]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[16]  Martin Schulz,et al.  Exploring hardware overprovisioning in power-constrained, high performance computing , 2013, ICS '13.

[17]  Dong Li,et al.  Strategies for Energy-Efficient Resource Management of Hybrid Programming Models , 2013, IEEE Transactions on Parallel and Distributed Systems.

[18]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[19]  Martin Schulz,et al.  A Run-Time System for Power-Constrained HPC Applications , 2015, ISC.

[20]  Allen D. Malony,et al.  Integrated Measurement for Cross-Platform OpenMP Performance Analysis , 2014, IWOMP.

[21]  Dimitrios S. Nikolopoulos,et al.  Online strategies for high-performance power-aware thread execution on emerging multiprocessors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[22]  Martin Schulz,et al.  Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[23]  Robert Dietrich,et al.  OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis , 2013, IWOMP.

[24]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[25]  Martin Schulz,et al.  Finding the limits of power-constrained application performance , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..