Optimizing Process-to-Core Mappings for Application Level Multi-dimensional MPI Communications

Multi-dimensional MPI communications, where MPI communications have to be performed in each dimension of a Cartesian communicator, have been frequently used in many of today's high performance computing applications. While individual MPI collective communications for regular communicators with a one-dimensional linear-ranking of processes have been extensively studied and optimized, little optimizations have been performed for multi-dimensional MPI collective communications on multi-dimensional Cartesian topology. In this paper, we optimize multi-dimensional MPI collective communications for SMP and multi-core systems at the application level. We show that the default Cartesian topology built by the state-of-the-art MPI implementations produce sub-optimal performance for multi-dimensional MPI collective communications. We design optimal process-to-core mapping schemes for Cartesian communicators to minimize the total inter-node communications. The proposed technique improves the performance by up to 80% over the default Cartesian topology built by Cray's MPI implementation MPT 3.1.02 on the world's current second fastest supercomputer Jaguar at Oak Ridge National Laboratory.

[1]  Robert A. van de Geijn,et al.  On the Efficiency of Global Combine Algorithms for 2-D Meshes With WormholeRouting , 1993 .

[2]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[3]  Robert A. van de Geijn,et al.  Global Combine Algorithms for 2-D Meshes with Wormhole Routing , 1995, J. Parallel Distributed Comput..

[4]  Georg Hager,et al.  Communication Characteristics and Hybrid MPI/OpenMP Parallel Pr ogramming on Clusters of Multi-core SMP Nodes , 2009 .

[5]  Laxmikant V. Kale,et al.  Automating Topology Aware Mapping for Supercomputers , 2010 .

[6]  Amith R. Mamidala,et al.  MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[7]  Hui Liu,et al.  Optimizing Process-to-Core Mappings for Two Dimensional Broadcast/Reduce on Multicore Architectures , 2011, 2011 International Conference on Parallel Processing.

[8]  Xin Yuan,et al.  Bandwidth Efficient All-to-All Broadcast on Switched Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[9]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[10]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[11]  Dhabaleswar K. Panda,et al.  Designing multi-leader-based Allgather algorithms for multi-core clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[12]  Robert A. van de Geijn,et al.  Collective communication on architectures that support simultaneous communication over multiple links , 2006, PPoPP '06.

[13]  Galen M. Shipman,et al.  MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives , 2008, PVM/MPI.

[14]  Xin Yuan,et al.  An MPI tool for automatically discovering the switch level topologies of Ethernet clusters , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[15]  Rajeev Rastogi,et al.  Topology discovery in heterogeneous IP networks: the NetInventory system , 2004, IEEE/ACM Transactions on Networking.

[16]  S. Lennart Johnsson,et al.  Distributed Routing Algorithms for Broadcasting and Personalized Communication in Hypercubes , 1986, ICPP.

[17]  Dhabaleswar K. Panda,et al.  Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[18]  Zizhong Chen,et al.  Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.

[19]  Dhabaleswar K. Panda,et al.  Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[20]  Jesper Larsson Träff,et al.  Full Bandwidth Broadcast, Reduction and Scan with Only Two Trees , 2007, PVM/MPI.

[21]  Jarek Nieplocha,et al.  Topology-aware tile mapping for clusters of SMPs , 2006, CF '06.

[22]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[23]  George Bosilca,et al.  Locality and Topology Aware Intra-node Communication among Multicore CPUs , 2010, EuroMPI.

[24]  Thomas Rauber,et al.  Optimizing MPI collective communication by orthogonal structures , 2006, Cluster Computing.