Asymptotically Optimal Load Balancing for Hierarchical Multi-Core Systems

Current multi-core machines feature a complex and hierarchical core topology, multiple levels of cache and memory subsystem with NUMA design. Although this design provides high processing power to parallel machines, it comes with the cost of asymmetric memory access latencies. Depending on the parallel application communication patterns, this asymmetry may reduce the overall performance of the system. Therefore, to achieve scalable performance in this environment, it becomes crucial to exploit the machine architecture while taking into account the application communication patterns. In this paper, we introduce a topology-aware load balancing algorithm named HWTOPOLB. It combines the machine topology characteristics with the communication patterns of the application to equalize the application load on the available cores while reducing latencies. We also present the proof that the algorithm is asymptotically optimal (Theorem 1). We have implemented our load balancing algorithm using the CHARM++ Parallel System and analyzed its performance using three different benchmarks. Our experimental results show that the HWTOPOLB can achieve average performance improvements of 24% when compared to existing load balancing strategies on three different multi-core machines.

[1]  Guillaume Mercier,et al.  Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments , 2009, PVM/MPI.

[2]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[3]  Gregory A. Koenig,et al.  Optimizing Distributed Application Performance Using Dynamic Grid Topology-Aware Load Balancing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Oscar H. Ibarra,et al.  Bounds for LPT Schedules on Uniform Processors , 1977, SIAM J. Comput..

[5]  Laxmikant V. Kalé,et al.  Optimizing a parallel runtime system for multicore clusters: a case study , 2010, TG.

[6]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[7]  Stephen L. Olivier,et al.  Scheduling task parallelism on multi-socket multicore systems , 2011, ROSS '11.

[8]  Jacques Carlier,et al.  Handbook of Scheduling - Algorithms, Models, and Performance Analysis , 2004 .

[9]  Bruno Raffin,et al.  A Work Stealing Algorithm for Parallel Loops on Shared Cache Multicores , 2010 .

[10]  Laxmikant V. Kale,et al.  Programming Petascale Applications with Charm , 2007 .

[11]  Laxmikant V. Kalé,et al.  Periodic hierarchical load balancing for large supercomputers , 2011, Int. J. High Perform. Comput. Appl..

[12]  Gengbin Zheng,et al.  Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing , 2005 .

[13]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[14]  Nick Netzer,et al.  The logit-response dynamics , 2010, Games Econ. Behav..

[15]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[16]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[17]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.