Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques

Current generations of NUMA node clusters feature multicore or manycore processors. Programming such architectures efficiently is a challenge because numerous hardware characteristics have to be taken into account, especially the memory hierarchy. One appealing idea to improve the performance of parallel applications is to decrease their communication costs by matching the communication pattern to the underlying hardware architecture. In this paper, we detail the algorithm and techniques proposed to achieve such a result: first, we gather both the communication pattern information and the hardware details. Then we compute a relevant reordering of the various process ranks of the application. Finally, those new ranks are used to reduce the communication costs of the application.

[1]  B. Brandfass,et al.  Rank reordering for MPI communication optimization , 2013 .

[2]  Jonathan Green,et al.  Multi-core and Network Aware MPI Topology Functions , 2011, EuroMPI.

[3]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[4]  Takao Hatazaki,et al.  Rank Reordering Strategy for MPI Topology Creation Functions , 1998, PVM/MPI.

[5]  Naixue Xiong,et al.  An approach for matching communication patterns in parallel applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[6]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[7]  M. L. Norman,et al.  Simulating Radiating and Magnetized Flows in Multiple Dimensions with ZEUS-MP , 2005, astro-ph/0511545.

[8]  Hubert Ritzdorf,et al.  The scalable process topology interface of MPI 2.2 , 2011, Concurr. Comput. Pract. Exp..

[9]  Jesper Larsson Träff Implementing the MPI process topology mechanism , 2002, SC '02.

[10]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[11]  David H. Bailey,et al.  NAS parallel benchmark results , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[12]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Pavan Balaji,et al.  Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems , 2011, Computer Science - Research and Development.

[14]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[15]  Emmanuel Jeannot,et al.  Improving MPI Applications Performance on Multicore Clusters with Rank Reordering , 2011, EuroMPI.

[16]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[17]  Guillaume Mercier,et al.  Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem , 2007, Parallel Comput..

[18]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[19]  Philippe Olivier Alexandre Navaux,et al.  Multi-core aware process mapping and its impact on communication overhead of parallel applications , 2009, 2009 IEEE Symposium on Computers and Communications.

[20]  Bruce Hendrickson,et al.  The Chaco user`s guide. Version 1.0 , 1993 .

[21]  José E. Moreira,et al.  Blue Gene system software - Topology mapping for Blue Gene/L supercomputer , 2006, SC.

[22]  Tomio Hirata,et al.  Approximation Algorithms for the Weighted Independent Set Problem , 2005, WG.

[23]  Thomas Hérault,et al.  Process Distance-Aware Adaptive MPI Collective Communications , 2011, 2011 IEEE International Conference on Cluster Computing.

[24]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[25]  José E. Moreira,et al.  Topology Mapping for Blue Gene/L Supercomputer , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[26]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[27]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[28]  B. Hendrickson The Chaco User � s Guide Version , 2005 .

[29]  Brian E. Smith,et al.  Performance Effects of Node Mappings on the IBM BlueGene/L Machine , 2005, Euro-Par.

[30]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[31]  Jin Zhang,et al.  Process Mapping for MPI Collective Communications , 2009, Euro-Par.

[32]  Guillaume Mercier,et al.  Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments , 2009, PVM/MPI.

[33]  Jeffrey M. Squyres,et al.  Locality-Aware Parallel Process Mapping for Multi-core HPC Systems , 2011, 2011 IEEE International Conference on Cluster Computing.

[34]  F. Pellegrini,et al.  Static mapping by dual recursive bipartitioning of process architecture graphs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[35]  Hao Zhu,et al.  Hierarchical Collectives in MPICH2 , 2009, PVM/MPI.

[36]  Kenji Ono,et al.  Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments , 2013 .