OpenMP Device Offloading to FPGAs Using the Nymble Infrastructure

Next to GPUs, FPGAs are an attractive target for OpenMP device offloading, as they allow to implement highly efficient, applicationspecific accelerators. However, prior approaches to support OpenMP device offloading for FPGAs have been limited by the interfaces provided by the FPGA vendors’ HLS tool interfaces or their integration with the OpenMP runtime, e.g., for data mapping. This work presents an approach to OpenMP device offloading for FPGAs based on the LLVM compiler infrastructure and the Nymble HLS compiler. The automatic compilation flow uses LLVM IR for HLS-specific optimizations and transformation and for the interaction with the Nymble HLS compiler. Parallel OpenMP constructs are automatically mapped to hardware threads executing simultaneously in the generated FPGA accelerator and the accelerator is integrated into libomptarget to support data-mapping. In a case study, we demonstrate the use of the compilation flow and evaluate its performance.

[1]  Jan Langer,et al.  OmpSs@Zynq all-programmable SoC ecosystem , 2014, FPGA.

[2]  Thomas Steinke,et al.  OpenMP to FPGA Offloading Prototype Using OpenCL SDK , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[3]  John L. Gustafson,et al.  Beating Floating Point at its Own Game: Posit Arithmetic , 2017, Supercomput. Front. Innov..

[4]  Andreas Koch,et al.  Synthesis of interleaved multithreaded accelerators from OpenMP loops , 2017, 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[5]  M. Mitchell Waldrop,et al.  The chips are down for Moore’s law , 2016, Nature.

[6]  Peter Lindstrom Universal Coding of the Reals using Bisection , 2019, CoNGA'19.

[7]  Andreas Koch,et al.  Hardware/software co-compilation with the Nymble system , 2013, 2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC).

[8]  R. C. Whaley,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..

[9]  Artur Podobas Accelerating Parallel Computations with OpenMP-Driven System-on-Chip Generation for FPGAs , 2014, 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs.

[10]  Eduard Ayguadé,et al.  Application Acceleration on FPGAs with OmpSs@FPGA , 2018, 2018 International Conference on Field-Programmable Technology (FPT).

[11]  Ben H. H. Juurlink,et al.  Nexus#: A Distributed Hardware Task Manager for Task-Based Programming Models , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[12]  Woody Sherman,et al.  Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGAs , 2019, 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[13]  Andreas Koch,et al.  Optimized high-level synthesis of SMT multi-threaded hardware accelerators , 2015, 2015 International Conference on Field Programmable Technology (FPT).

[14]  Weng-Fai Wong,et al.  Generating hardware from OpenMP programs , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[15]  Satoshi Matsuoka,et al.  Designing and accelerating spiking neural networks using OpenCL for FPGAs , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[16]  Satoshi Matsuoka,et al.  Hardware Implementation of POSITs and Their Application in FPGAs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[17]  Hiroyuki Takizawa,et al.  Scaling Performance for N-Body Stream Computation with a Ring of FPGAs , 2019, HEART.

[18]  Mats Brorsson,et al.  Empowering OpenMP with automatically generated hardware , 2016, 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS).

[19]  Michael Philippsen,et al.  OpenMP on FPGAs - A Survey , 2019, IWOMP.

[20]  Daniel D. Gajski,et al.  High ― Level Synthesis: Introduction to Chip and System Design , 1992 .

[21]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[23]  Tian Jin,et al.  Offloading Support for OpenMP in Clang and LLVM , 2016, 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC).

[24]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[25]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[26]  Satoshi Matsuoka,et al.  High-Performance High-Order Stencil Computation on FPGAs Using OpenCL , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[27]  Jason Helge Anderson,et al.  From software threads to parallel hardware in high-level synthesis for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[28]  Eduard Ayguadé,et al.  OpenMP extensions for FPGA accelerators , 2009, 2009 International Symposium on Systems, Architectures, Modeling, and Simulation.

[29]  Minh N. Do,et al.  Youn-Long Steve Lin , 1992 .

[30]  Guido Araujo,et al.  Automatic Offloading of Cluster Accelerators , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[31]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[32]  Alessandro Cilardo,et al.  Efficient and scalable OpenMP-based system-level design , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[33]  Andreas Koch,et al.  OpenMP device offloading to FPGA accelerators , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).