Communication and topology-aware load balancing in Charm++ with TreeMatch

Programming multicore or manycore architectures is a hard challenge particularly if one wants to fully take advantage of their computing power. Moreover, a hierarchical topology implies that communication performance is heterogeneous and this characteristic should also be exploited. We developed two load balancers for Charm++ that take into account both aspects, depending on the fact that the application is compute-bound or communication-bound. This work is based on our TREEMATCH library that computes process placement in order to reduce an application communication costs based on the hardware topology. We show that the proposed load-balancing schemes manage to improve the execution times for the two aforementioned classes of parallel applications.

[1]  Laxmikant V. Kale,et al.  Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects , 2009 .

[2]  Hubert Ritzdorf,et al.  The scalable process topology interface of MPI 2.2 , 2011, Concurr. Comput. Pract. Exp..

[3]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[4]  Emmanuel Jeannot,et al.  Improving MPI Applications Performance on Multicore Clusters with Rank Reordering , 2011, EuroMPI.

[5]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[6]  Laxmikant V. Kalé,et al.  Benefits of Topology Aware Mapping for Mesh Interconnects , 2008, Parallel Process. Lett..

[7]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[8]  Silvio Micali,et al.  An O(v|v| c |E|) algoithm for finding maximum matching in general graphs , 1980, 21st Annual Symposium on Foundations of Computer Science (sfcs 1980).

[9]  F. Pellegrini,et al.  Static mapping by dual recursive bipartitioning of process architecture graphs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[10]  Laxmikant V. Kalé,et al.  Massively parallel cosmological simulations with ChaNGa , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[11]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[12]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[13]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[14]  Philippe Olivier Alexandre Navaux,et al.  Asymptotically Optimal Load Balancing for Hierarchical Multi-Core Systems , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[15]  Laxmikant V. Kalé,et al.  NAMD: a Parallel, Object-Oriented Molecular Dynamics Program , 1996, Int. J. High Perform. Comput. Appl..

[16]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[17]  Thomas Rauber,et al.  Mapping Algorithms for Multiprocessor Tasks on Multi-Core Clusters , 2008, 2008 37th International Conference on Parallel Processing.

[18]  Laxmikant V. Kalé,et al.  A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems , 2012, 2012 41st International Conference on Parallel Processing.

[19]  B. Brandfass,et al.  Rank reordering for MPI communication optimization , 2013 .