A Memory Congestion-Aware MPI Process Placement for Modern NUMA Systems

MPI process placement is an important step to achieve scalable performance on modern non-uniform memory access (NUMA) systems. A recent study on NUMA architectures has shown that, on modern NUMA systems, the memory congestion problem could cause more severe performance degradation than the data locality problem because heavy congestion on memory controllers could cause long latencies. However, conventional work on MPI process placement has focused on locality to minimize the remote-access communication. Moreover, maximizing the locality may actually degrade performance because the load imbalance among nodes in a modern NUMA system may increase. Thus, a process placement algorithm must be designed to consider memory congestion. In this paper, a method to reconcile both the locality and the memory congestion on modern NUMA systems is proposed. This method statically analyzes the application communication pattern to optimize the process placement. A data clustering method is applied to the time-series data of the MPI communications in order to identify data traffics that potentially cause memory congestion. The proposed method has been evaluated with the NPB kernels on a real NUMA system and a simulation environment. Experimental results show that the proposed method can achieve 1.6x performance improvement compared with the current state-of-the-art strategy.

[1]  Henri Casanova,et al.  Speed and accuracy of network simulation in the SimGrid framework , 2007, ValueTools '07.

[2]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[3]  Emmanuel Jeannot,et al.  Communication and topology-aware load balancing in Charm++ with TreeMatch , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[4]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[5]  Naixue Xiong,et al.  An approach for matching communication patterns in parallel applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[6]  Laxmikant V. Kalé,et al.  Dynamic topology aware load balancing algorithms for molecular dynamics applications , 2009, ICS.

[7]  George Bosilca,et al.  Online Dynamic Monitoring of MPI Communications , 2017, Euro-Par.

[8]  Laxmikant V. Kalé,et al.  A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems , 2012, 2012 41st International Conference on Parallel Processing.

[9]  Robert Schöne,et al.  Main memory and cache performance of intel sandy bridge and AMD bulldozer , 2014, MSPC@PLDI.

[10]  Ahmad Faraj,et al.  Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.

[11]  John M. Mellor-Crummey,et al.  A tool to analyze the performance of multithreaded programs on NUMA architectures , 2014, PPoPP '14.

[12]  I. Lee,et al.  Characterizing communication patterns of NAS-MPI benchmark programs , 2009, IEEE Southeastcon 2009.

[13]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[14]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[15]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[16]  Robert J. Safranek,et al.  Intel® QuickPath Interconnect Architectural Features Supporting Scalable System Architectures , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[17]  Philippe Olivier Alexandre Navaux,et al.  Asymptotically Optimal Load Balancing for Hierarchical Multi-Core Systems , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[18]  Thomas Hérault,et al.  Process Distance-Aware Adaptive MPI Collective Communications , 2011, 2011 IEEE International Conference on Cluster Computing.

[19]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[20]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[21]  Jin Zhang,et al.  Process Mapping for MPI Collective Communications , 2009, Euro-Par.

[22]  Arnaud Legrand,et al.  Simulating MPI Applications: The SMPI Approach , 2017, IEEE Transactions on Parallel and Distributed Systems.

[23]  Emmanuel Jeannot,et al.  Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques , 2014, IEEE Transactions on Parallel and Distributed Systems.

[24]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.