Accelerating Intercommunication in Highly Parallel Systems

Every HPC system consists of numerous processing nodes interconnect using a number of different inter-process communication protocols such as Messaging Passing Interface (MPI) and Global Arrays (GA). Traditionally, research has focused on optimizing these protocols and identifying the most suitable ones for each system and/or application. Recently, there has been a proposal to unify the primitive operations of the different inter-processor communication protocols through the Portals library. Portals offer a set of low-level communication routines which can be composed in order to implement the functionality of different intercommunication protocols. However, Portals modularity comes at a performance cost, since it adds one more layer in the actual protocol implementation. This work aims at closing the performance gap between a generic and reusable intercommunication layer, such as Portals, and the several monolithic and highly optimized intercommunication protocols. This is achieved through the development of a novel hardware offload engine efficiently implementing the basic Portals’ modules. Our innovative system is up to two2 orders of magnitude faster than the conventional software implementation of Portals’ while the speedup achieved over the conventional monolithic software implementations of MPI and GAs is more than an order of magnitude. The power consumption of our hardware system is less than 1/100th of what a low-power CPU consumes when executing the Portal's software while its silicon cost is less than 1/10th of that of a very simple RISC CPU. Moreover, our design process is also innovative since we have first modeled the hardware within an untimed virtual prototype which allowed for rapid design space exploration; then we applied a novel methodology to transform the untimed description into an efficient timed hardware description, which was then transformed into a hardware netlist through a High-Level Synthesis (HLS) tool.

[1]  Arturo Sarmiento-Reyes,et al.  VLSI-SoC: Internet of Things Foundations , 2014, IFIP Advances in Information and Communication Technology.

[2]  Donald E. Knuth The art of computer programming: fundamental algorithms , 1969 .

[3]  Jack Dongarra,et al.  Recent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings , 2010, EuroMPI.

[4]  Abhinav Vishnu,et al.  On the suitability of MPI as a PGAS runtime , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[5]  Brian W. Barrett,et al.  The Portals 4.3 Network Programming Interface , 2014 .

[6]  Keith D. Underwood,et al.  Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications , 2005, Int. J. High Perform. Comput. Appl..

[7]  Keith D. Underwood,et al.  An analysis of NIC resource usage for offloading MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[8]  Keith D. Underwood,et al.  Mitigating MPI Message Matching Misery , 2016, ISC.

[9]  E. Hernández,et al.  Molecular Dynamics: from basic techniques to applications (A Molecular Dynamics Primer) , 2008 .

[10]  Richard L. Graham,et al.  Characteristics of the Unexpected Message Queue of MPI Applications , 2010, EuroMPI.

[11]  Karl S. Hemmert,et al.  Using Triggered Operations to Offload Rendezvous Messages , 2011, EuroMPI.

[12]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[13]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[14]  Ioannis Papaefstathiou,et al.  Significantly reducing MPI intercommunication latency and power overhead in both embedded and HPC systems , 2013, TACO.

[15]  H. Pritchard,et al.  The GNI Provider Layer for OFI libfabric , 2016 .

[16]  Daniel Gajski,et al.  Transaction level modeling: an overview , 2003, First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721).

[17]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[18]  Jean-Pierre Panziera,et al.  The BXI Interconnect Architecture , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[19]  A. Frumkin Data Flow Pattern Analysis of Scientific Applications February , 2005 .

[20]  Keith D. Underwood,et al.  SeaStar Interconnect: Balanced Bandwidth for Scalable Performance , 2006, IEEE Micro.

[21]  A. Vishnu,et al.  PGAS Models using an MPI Runtime : Design Alternatives and Performance Evaluation , 2013 .

[22]  Karl S. Hemmert,et al.  Enhanced Support for OpenSHMEM Communication in Portals , 2011, 2011 IEEE 19th Annual Symposium on High Performance Interconnects.

[23]  Torsten Hoefler,et al.  Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters , 2013, 2013 42nd International Conference on Parallel Processing.