Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

With each successive generation, network adapters for high-performance networks are becoming more powerful and feature rich. High-performance NICs can now provide support for performing complex group communication operations on the NIC without any host CPU involvement. Several "offloading interfaces" have been designed with the collective communications goal being the complete offloading of arbitrary communication patterns. In this work, we analyze the offloading model offered in the Portals 4 specification in detail. We perform a theoretical analysis based on abstract communication graphs and show several protocols for implementing offloaded communication schedules. Based on our analysis, we propose and implement an extension to the Portals 4 specification that enables offloading any communication pattern completely to the NIC. Our measurements with several advanced communication algorithms confirm that the enhancements provide good overlap and asynchronous progress in practical settings. Altogether, we demonstrate a complete and simple scheme for implementing arbitrary offloaded communication algorithms and hardware. Our protocols can act as a blueprint for the development of communication hardware and middleware while optimizing the whole communication stack.

[1]  Sayantan Sur,et al.  Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[2]  Stephen W. Poole,et al.  Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[3]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[4]  Richard L. Graham,et al.  Open MPI: A Flexible High Performance MPI , 2005, PPAM.

[5]  Brian W. Barrett,et al.  The Portals 4.3 Network Programming Interface , 2014 .

[6]  BruckJehoshua,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997 .

[7]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[8]  Torsten Hoefler,et al.  Runtime detection and optimization of collective communication patterns , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  J. van Leeuwen,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface , 2002, Lecture Notes in Computer Science.

[10]  Scott Pakin Receiver-initiated message passing over RDMA Networks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[11]  Vinton G. Cerf,et al.  A protocol for packet network intercommunication , 1974, CCRV.

[12]  Duncan Roweth,et al.  Optimised Global Reduction on QsNetII , 2005, Hot Interconnects.

[13]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[14]  Karl S. Hemmert,et al.  Using Triggered Operations to Offload Rendezvous Messages , 2011, EuroMPI.

[15]  Karl S. Hemmert,et al.  Using Triggered Operations to Offload Collective Communication Operations , 2010, EuroMPI.

[16]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[17]  Manjunath Gorentla Venkata,et al.  ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[18]  Jesper Larsson Träff,et al.  Optimal Broadcast for Fully Connected Networks , 2005, HPCC.

[19]  D. Panda,et al.  High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters , 2005, HiPC.

[20]  Steve Poole,et al.  ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[21]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[22]  Dhabaleswar K. Panda,et al.  Fast NIC-based barrier over Myrinet/GM , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[23]  Torsten Hoefler,et al.  Group Operation Assembly Language - A Flexible Way to Express Collective Communication , 2009, 2009 International Conference on Parallel Processing.

[24]  Torsten Hoefler,et al.  Optimization principles for collective neighborhood communications , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Dhabaleswar K. Panda,et al.  Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages , 2000, CANPC.

[26]  Yutaka Ishikawa,et al.  Design of Kernel-Level Asynchronous Collective Communication , 2010, EuroMPI.

[27]  D. Roweth,et al.  Optimised global reduction on QsNet/sup II/ , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[28]  Torsten Hoefler,et al.  Message progression in parallel computing - to thread or not to thread? , 2008, 2008 IEEE International Conference on Cluster Computing.

[29]  Torsten Hoefler,et al.  Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned , 2011, Euro-Par.

[30]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[31]  Karl S. Hemmert,et al.  Enabling Flexible Collective Communication Offload with Triggered Operations , 2011, 2011 IEEE 19th Annual Symposium on High Performance Interconnects.

[32]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.