ARCS: Adaptive Runtime Configuration Selection for Power-Constrained OpenMP Applications

Power is the most critical resource for the exascale high performance computing. In the future, system administrators might have to pay attention to the power consumption of the machine under different work loads. Hence, each application may have to run with an allocated power budget. Thus, achieving the best performance on future machines requires optimal performance subject to a power constraint. This additional performance requirement should not be the responsibility of HPC~(High Performance Computing) application developers. Optimizing the performance for a given power budget should be the responsibility of high-performance system software stack. Modern machines allow power capping of CPU and memory to implement power budgeting strategy. Finding the best runtime environment for a node at a given power level is important to get the best performance. This paper presents ARCS (Adaptive Runtime Configuration Selection) frameworkthat automatically selects the best runtime configuration for each OpenMPparallel region at a given power level. The framework uses OMPT (OpenMP Tools) API, APEX(Autonomic Performance Environment for eXascale), and Active Harmony frameworksto explore configuration search space and selects the best number of threads, scheduling policy, and chunk size for a given power level at run-time. We test ARCS using the NAS Parallel Benchmark, and proxy application LULESH with Intel Sandybridge, and IBM Power multi-core architectures. We show that for a given power level, efficient OpenMP runtime parameter selection can improve the execution time and energy consumption of an application up to 40% and 42% respectively.

[1]  John Cavazos,et al.  Using Per-Loop CPU Clock Modulation for Energy Efficiency in OpenMP Applications , 2015, 2015 44th International Conference on Parallel Processing.

[2]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[3]  Martin Schulz,et al.  Finding the limits of power-constrained application performance , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[5]  Indrani Paul,et al.  Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[7]  Dong Li,et al.  Strategies for Energy-Efficient Resource Management of Hybrid Programming Models , 2013, IEEE Transactions on Parallel and Distributed Systems.

[8]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[9]  Robert Dietrich,et al.  OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis , 2013, IWOMP.

[10]  David A. Wood,et al.  Cache Power Budgeting for Performance , 2013 .

[11]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[12]  Alvin M. Despain,et al.  Cache designs for energy efficiency , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[13]  Martin Schulz,et al.  Exploring hardware overprovisioning in power-constrained, high performance computing , 2013, ICS '13.

[14]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[15]  Robert J. Fowler,et al.  An early prototype of an autonomic performance environment for exascale , 2013, ROSS '13.

[16]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[17]  Martin Schulz,et al.  Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[18]  Bronis R. de Supinski,et al.  Prediction models for multi-dimensional power-performance optimization on many cores , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[20]  Allen D. Malony,et al.  An Autonomic Performance Environment for Exascale , 2015, Supercomput. Front. Innov..

[21]  Martin Schulz,et al.  A Run-Time System for Power-Constrained HPC Applications , 2015, ISC.

[22]  Thomas L. Sterling,et al.  A Dynamic Execution Model Applied to Distributed Collision Detection , 2014, ISC.

[23]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[24]  Allen D. Malony,et al.  Integrated Measurement for Cross-Platform OpenMP Performance Analysis , 2014, IWOMP.

[25]  Dimitrios S. Nikolopoulos,et al.  Online strategies for high-performance power-aware thread execution on emerging multiprocessors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[26]  Martin Schulz,et al.  Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems , 2014, 2014 43rd International Conference on Parallel Processing.

[27]  Dong Li,et al.  Hybrid MPI/OpenMP power-aware computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).