Enabling shared memory communication in networks of MPSoCs

Ongoing transistor scaling and the growing complexity of embedded system designs has led to the rise of MPSoCs (Multi‐Processor System‐on‐Chip), combining multiple hard‐core CPUs and accelerators (FPGA, GPU) on the same physical die. These devices are of great interest to the supercomputing community, who are increasingly reliant on heterogeneity to achieve power and performance goals in these closing stages of the race to exascale. In this paper, we present a network interface architecture and networking infrastructure, designed to sit inside the FPGA fabric of a cutting‐edge MPSoC device, enabling networks of these devices to communicate within both a distributed and shared memory context, with reduced need for costly software networking system calls. We will present our implementation and prototype system and discuss the main design decisions relevant to the use of the Xilinx Zynq Ultrascale+, a state‐of‐the‐art MPSoC, and the challenges to be overcome given the device's limitations and constraints. We demonstrate the working prototype system connecting two MPSoCs, with communication between processor and remote memory region and accelerator. We then discuss the limitations of the current implementation and highlight areas of improvement to make this solution production‐ready.

[1]  Wim Vanderbauwhede,et al.  High-Performance Computing Using FPGAs , 2013 .

[2]  Jinzhe Yang,et al.  Potential future exposure, modelling and accelerating on GPU and FPGA , 2015, WHPCF@SC.

[3]  Mark Bohr,et al.  A 30 Year Retrospective on Dennard's MOSFET Scaling Paper , 2007, IEEE Solid-State Circuits Newsletter.

[4]  Huseyin Seker,et al.  Highly Parameterized K-means Clustering on FPGAs: Comparative Results with GPPs and GPUs , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[5]  Christoph Hagleitner,et al.  An FPGA Platform for Hyperscalers , 2017, 2017 IEEE 25th Annual Symposium on High-Performance Interconnects (HOTI).

[6]  Steven Swanson,et al.  Latency-Optimized Networks for Clustering FPGAs , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[7]  Greg Stitt,et al.  A comparison of correntropy-based feature tracking on FPGAs and GPUs , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[8]  Ron Sass,et al.  An Evaluation of an Integrated On-Chip/Off-Chip Network for High-Performance Reconfigurable Computing , 2012, Int. J. Reconfigurable Comput..

[9]  Eriko Nurvitadhi,et al.  Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[10]  Rishiyur S. Nikhil,et al.  Bluespec System Verilog: efficient, correct RTL from high level specifications , 2004, Proceedings. Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2004. MEMOCODE '04..

[11]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[12]  Luciano Lavagno,et al.  ECOSCALE: Reconfigurable computing and runtime system for future exascale systems , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Scott B. Baden,et al.  Accelerating Viola-Jones Face Detection to FPGA-Level Using GPUs , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[14]  Robert H. Dennard,et al.  A 30 Year Retrospective on Dennard's MOSFET Scaling Paper , 2007 .

[15]  Kees Goossens,et al.  The future of computing : essays in memory of Stamatis Vassiliadis , 2007 .

[16]  Nachiket Kapre,et al.  Zedwulf: Power-Performance Tradeoffs of a 32-Node Zynq SoC Cluster , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[17]  Andrew W. Moore,et al.  Interconnect for commodity FPGA clusters: Standardized or customized? , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[18]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[19]  Lin Li,et al.  SMCFA: A Zynq-based stacked multi CPU-FPGA architecture , 2016, 2016 International Conference on Field-Programmable Technology (FPT).

[20]  Pier Stanislao Paolucci,et al.  Design and implementation of a modular, low latency, fault-aware, FPGA-based network interface , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[21]  Javier Navaridas,et al.  A CAM-Free Exascalable HPC Router for Low-Energy Communications , 2018, ARCS.

[22]  Jason Cong,et al.  FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[23]  Paul M. Carpenter,et al.  EUROSERVER: Share-anything scale-out micro-server design , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Stephen Booth,et al.  Maxwell - a 64 FPGA Supercomputer , 2007, Second NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2007).

[25]  Pier Stanislao Paolucci,et al.  APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters , 2011, ArXiv.

[26]  M. A. O. Ignacio,et al.  How to cite this article , 2016 .