Streaming message interface: high-performance distributed memory programming on reconfigurable hardware

Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.

[1]  Ralph Wittig,et al.  MPI as a Programming Model for High-Performance Reconfigurable Computers , 2010, TRETS.

[2]  Gustavo Alonso,et al.  FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[3]  Guo Chen,et al.  Direct Universal Access: Making Data Center Resources Available to FPGA , 2019, NSDI.

[4]  Chen Yang,et al.  Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[5]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[6]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[7]  Taisuke Boku,et al.  OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing , 2018, HPC Asia.

[8]  Torsten Hoefler,et al.  Transformations of High-Level Synthesis Codes for High-Performance Computing , 2018, IEEE Transactions on Parallel and Distributed Systems.

[9]  Gustavo Alonso,et al.  Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[10]  Torsten Hoefler,et al.  FBLAS: Streaming Linear Algebra on FPGA , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[12]  Satoru Yamamoto,et al.  Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[13]  Robert G. Dimond,et al.  Accelerating Large-Scale HPC Applications Using FPGAs , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[14]  Jason Cong,et al.  Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster , 2016, ISLPED.

[15]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[16]  Alan D. George,et al.  An OpenCL Framework for Distributed Apps on a Multidimensional Network of FPGAs , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[17]  Wu-chun Feng,et al.  MPI-ACC: Accelerator-Aware MPI for Scientific Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[18]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[19]  Chen Yang,et al.  FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[20]  Naif Tarafdar,et al.  A Modular Heterogeneous Stack for Deploying FPGAs and CPUs in the Data Center , 2019, FPGA.

[21]  J. Demmel,et al.  Sun Microsystems , 1996 .

[22]  Torsten Hoefler,et al.  Deadlock-Free Oblivious Routing for Arbitrary Topologies , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[23]  Jungwon Kim,et al.  IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism , 2016, HPDC.

[24]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[25]  Torsten Hoefler,et al.  dCUDA: Hardware Supported Overlap of Computation and Communication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.