Multi-Level Load Balancing with an Integrated Runtime Approach

The recent trend of increasing numbers of cores per chip has resulted in vast amounts of on-node parallelism. These high core counts result in hardware variability that introduces imbalance. Applications are also becoming more complex, re-sulting in dynamic load imbalance. Load imbalance of any kind can result in loss of performance and system utilization. We address the challenge of handling both transient and persistent load imbalances while maintaining locality with low overhead. In this paper, we propose an integrated runtime system that combines the Charm++ distributed programming model with concurrent tasks to mitigate load imbalances within and across shared memory address spaces. It utilizes a periodic assignment of work to cores based on load measurement, in combination with user created tasks to handle load imbalance. We integrate OpenMP with Charm++ to enable creation of potential tasks via OpenMP's parallel loop construct. This is also available to MPI applications through the Adaptive MPI implementation. We demonstrate the benefits of our work on three applications. We show improvements of Lassen by 29.6% on Cori and 46.5% on Theta. We also demonstrate the benefits on a Charm++ application, ChaNGa by 25.7% on Theta, as well as an MPI proxy application, Kripke, using Adaptive MPI.

[1]  William Gropp,et al.  Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters , 2011, 2011 18th International Conference on High Performance Computing.

[2]  William Gropp,et al.  Locality-Optimized Mixed Static/Dynamic Scheduling for Improving Load Balancing on SMPs , 2014, EuroMPI/ASIA.

[3]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[4]  Tao Yang,et al.  Program transformation and runtime support for threaded MPI execution on shared-memory machines , 2000, TOPL.

[5]  Torsten Hoefler,et al.  Hybrid MPI: Efficient message passing for multi-core systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[7]  Suzanne M. Kelly,et al.  Software Architecture of the Light Weight Kernel, Catamount , 2005 .

[8]  Tao Yang,et al.  Adaptive Two-level Thread Management for Fast MPI Execution on Shared Memory Machines , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[9]  Sameer Kumar,et al.  Evaluating the effect of replacing CNK with linux on the compute-nodes of blue gene/l , 2008, ICS '08.

[10]  Patrick Carribault,et al.  MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption , 2009, PVM/MPI.

[11]  Rajeev Thakur,et al.  Hybrid parallel programming with MPI and unified parallel C , 2010, Conf. Computing Frontiers.

[12]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[13]  Erik D. Demaine,et al.  A Threads-Only MPI Implementation for the Development of Parallel Programs , 1997 .

[14]  Laxmikant V. Kalé,et al.  Variation Among Processors Under Turbo Boost in HPC Systems , 2016, ICS.

[15]  Chao Mei,et al.  Message-driven parallel language runtime design and optimizations for multicore-based massively parallel machines , 2012 .

[16]  Laxmikant V. Kalé,et al.  Integrating OpenMP into the Charm++ Programming Model , 2017, ESPM2@SC.

[17]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[18]  Nancy M. Amato,et al.  Quantifying the effectiveness of load balance algorithms , 2012, ICS '12.

[19]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[20]  Eduard Ayguadé,et al.  Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[21]  Abhishek Gupta,et al.  Parallel Programming with Migratable Objects: Charm++ in Practice , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Yi Guo,et al.  The habanero multicore software research project , 2009, OOPSLA Companion.

[23]  Raymond Namyst,et al.  MPC: A Unified Parallel Runtime for Clusters of NUMA Machines , 2008, Euro-Par.

[24]  Mark Bull,et al.  Development of mixed mode MPI / OpenMP applications , 2001, Sci. Program..

[25]  Peter N. Brown,et al.  KRIPKE - A MASSIVELY PARALLEL TRANSPORT MINI-APP , 2015 .

[26]  Patrick Carribault,et al.  Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC , 2010, IWOMP.

[27]  William Gropp,et al.  Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[28]  Chuck Pheatt,et al.  Intel® threading building blocks , 2008 .

[29]  Handling Transient and Persistent Imbalance Together in Distributed and Shared Memory , 2016 .

[30]  Torsten Hoefler,et al.  Leveraging MPI's One-Sided Communication Interface for Shared-Memory Programming , 2012, EuroMPI.

[31]  Laura Grigori,et al.  Lightweight Scheduling for Balancing the Tradeoff Between Load Balance and Locality , 2014 .

[32]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Alejandro Duran,et al.  Productive Cluster Programming with OmpSs , 2011, Euro-Par.

[34]  E. Tollerud,et al.  The Sagittarius impact as an architect of spirality and outer rings in the Milky Way , 2011, Nature.

[35]  Alejandro Duran,et al.  Dynamic load balancing of MPI+OpenMP applications , 2004 .

[36]  Torsten Hoefler,et al.  MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory , 2013, Computing.

[37]  Scott Pakin,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.

[38]  Tucson,et al.  The AGORA High-resolution Galaxy Simulations Comparison Project. III. Cosmological Zoom-in Simulation of a Milky Way–mass Halo , 2013, The Astrophysical Journal.

[39]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.