论文信息 - Optimizing Process-to-Core Mappings for Application Level Multi-dimensional MPI Communications

Optimizing Process-to-Core Mappings for Application Level Multi-dimensional MPI Communications

Multi-dimensional MPI communications, where MPI communications have to be performed in each dimension of a Cartesian communicator, have been frequently used in many of today's high performance computing applications. While individual MPI collective communications for regular communicators with a one-dimensional linear-ranking of processes have been extensively studied and optimized, little optimizations have been performed for multi-dimensional MPI collective communications on multi-dimensional Cartesian topology. In this paper, we optimize multi-dimensional MPI collective communications for SMP and multi-core systems at the application level. We show that the default Cartesian topology built by the state-of-the-art MPI implementations produce sub-optimal performance for multi-dimensional MPI collective communications. We design optimal process-to-core mapping schemes for Cartesian communicators to minimize the total inter-node communications. The proposed technique improves the performance by up to 80% over the default Cartesian topology built by Cray's MPI implementation MPT 3.1.02 on the world's current second fastest supercomputer Jaguar at Oak Ridge National Laboratory.

Zizhong Chen | Christer Karlsson | Teresa Davies

[1] Robert A. van de Geijn,et al. On the Efficiency of Global Combine Algorithms for 2-D Meshes With WormholeRouting , 1993 .

[2] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[3] Robert A. van de Geijn,et al. Global Combine Algorithms for 2-D Meshes with Wormhole Routing , 1995, J. Parallel Distributed Comput..

[4] Georg Hager,et al. Communication Characteristics and Hybrid MPI/OpenMP Parallel Pr ogramming on Clusters of Multi-core SMP Nodes , 2009 .

[5] Laxmikant V. Kale,et al. Automating Topology Aware Mapping for Supercomputers , 2010 .

[6] Amith R. Mamidala,et al. MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[7] Hui Liu,et al. Optimizing Process-to-Core Mappings for Two Dimensional Broadcast/Reduce on Multicore Architectures , 2011, 2011 International Conference on Parallel Processing.

[8] Xin Yuan,et al. Bandwidth Efficient All-to-All Broadcast on Switched Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[9] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[10] Jack Dongarra,et al. MPI: The Complete Reference , 1996 .

[11] Dhabaleswar K. Panda,et al. Designing multi-leader-based Allgather algorithms for multi-core clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[12] Robert A. van de Geijn,et al. Collective communication on architectures that support simultaneous communication over multiple links , 2006, PPoPP '06.

[13] Galen M. Shipman,et al. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives , 2008, PVM/MPI.

[14] Xin Yuan,et al. An MPI tool for automatically discovering the switch level topologies of Ethernet clusters , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[15] Rajeev Rastogi,et al. Topology discovery in heterogeneous IP networks: the NetInventory system , 2004, IEEE/ACM Transactions on Networking.

[16] S. Lennart Johnsson,et al. Distributed Routing Algorithms for Broadcasting and Personalized Communication in Hypercubes , 1986, ICPP.

[17] Dhabaleswar K. Panda,et al. Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[18] Zizhong Chen,et al. Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.

[19] Dhabaleswar K. Panda,et al. Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[20] Jesper Larsson Träff,et al. Full Bandwidth Broadcast, Reduction and Scan with Only Two Trees , 2007, PVM/MPI.

[21] Jarek Nieplocha,et al. Topology-aware tile mapping for clusters of SMPs , 2006, CF '06.

[22] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[23] George Bosilca,et al. Locality and Topology Aware Intra-node Communication among Multicore CPUs , 2010, EuroMPI.

[24] Thomas Rauber,et al. Optimizing MPI collective communication by orthogonal structures , 2006, Cluster Computing.