论文信息 - Design and Characterization of InfiniBand Hardware Tag Matching in MPI

Design and Characterization of InfiniBand Hardware Tag Matching in MPI

Message Passing Interface (MPI) standard uses (source rank, tag, and communicator id) to properly place the incoming data into the application receive buffer. The act of searching through the receive queues and finding the appropriate match is called Tag Matching (TM). In the state-of-the-art MPI libraries, this operation is either being performed by the main thread or a separate communication progress thread. Either way leads to underutilization of the resources and major synchronization overheads leading to less optimal performance. Mellanox ConnectX-5 network architecture has introduced a feature to offload the Tag Matching and communication progress from host to InfiniBand network card. This paper proposes a Hardware Tag Matching aware MPI library and discusses various aspects and challenges of leveraging this feature in MPI library. Moreover, it characterizes hardware Tag Matching using different benchmarks and provides guidelines for the application developers to develop Hardware Tag Matching-aware applications to maximize their usage of this feature. Our proposed designs are able to improve the performance of non-blocking collectives up to 42% on 512 nodes and improve the performance of 3Dstencil application kernel on 7168 processes and Nekbone on 512 processes by a factor 40% and 3.5%, respectively.

[1] Anoop Gupta,et al. SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[2] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .

[3] Philip K. McKinley,et al. Efficient collective operations with ATM network interface support , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[4] Kees Verstoep,et al. Efficient reliable multicast on Myrinet , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[5] Jack Dongarra,et al. Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface , 1997 .

[6] Andreas Holzman. Recent Advances in Parallel Virtual Machine and Message Passing Interface , 2001, Lecture Notes in Computer Science.

[7] V. E. Henson,et al. BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[8] Leonid Oliker,et al. Message passing and shared address space parallelism on an SMP cluster , 2003, Parallel Comput..

[9] D.K. Panda,et al. Scalable NIC-based Reduction on Large-scale Clusters , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[10] Keith D. Underwood,et al. Evaluation of an Eager Protocol Optimization for MPI , 2003, PVM/MPI.

[11] Dhabaleswar K. Panda,et al. Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[12] Keith D. Underwood,et al. An analysis of NIC resource usage for offloading MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[13] Sayantan Sur,et al. Shared receive queue based scalable MPI design for InfiniBand clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[14] Sayantan Sur,et al. RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[15] Sayantan Sur,et al. Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[16] Stephen W. Poole,et al. Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[17] Karl S. Hemmert,et al. Using Triggered Operations to Offload Rendezvous Messages , 2011, EuroMPI.

[18] Ahmad Afsahi,et al. An Efficient MPI Message Queue Mechanism for Large-scale Jobs , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[19] Keith D. Underwood,et al. Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[20] Dhabaleswar K. Panda,et al. Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters , 2015, ISC.

[21] Dhabaleswar K. Panda,et al. Adaptive and Dynamic Design for MPI Tag Matching , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[22] Dhabaleswar K. Panda,et al. Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication , 2017, ISC.

[23] S. M. Ghazimirsaeed,et al. Accelerating MPI Message Matching by a Data Clustering Strategy , 2017 .

[24] Dhabaleswar K. Panda,et al. Cooperative Rendezvous Protocols for Improved Performance and Overlap , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25] Ryan E. Grant,et al. A Dedicated Message Matching Mechanism for Collective Communications , 2018, ICPP Workshops.

[26] Michael J. Levenhagen,et al. The Case for Semi-Permanent Cache Occupancy: Understanding the Impact of Data Locality on Network Processing , 2018, ICPP.

[27] Ryan E. Grant,et al. Fuzzy Matching: Hardware Accelerated MPI Communication Middleware , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[28] Dhabaleswar K. Panda,et al. Designing a Profiling and Visualization Tool for Scalable and In-depth Analysis of High-Performance GPU Clusters , 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[29] Ryan E. Grant,et al. A dynamic, unified design for dedicated message matching engines for collective and point-to-point communications , 2019, Parallel Comput..

[30] Ahmad Afsahi,et al. Communication‐aware message matching in MPI , 2018, Concurr. Comput. Pract. Exp..