Extending High-Level Synthesis with High-Performance Computing Performance Visualization

The recent maturity in High-Level Synthesis (HLS) has renewed the interest of using Field-Programmable Gate-Arrays (FPGAs) to accelerate High-Performance Computing (HPC) applications. Today, several studies have shown performance- and power-benefits of using FPGAs compared to existing approaches for a number of application kernels with ample room for improvements. Unfortunately, modern HLS tools offer little support to gain clarity and insight regarding why a certain application behaves as it does on the FPGA, and most experts rely on intuition or abstract performance models. In this work, we hypothesize that existing profiling and visualization tools used in the HPC domain are also usable for understanding performance on FPGAs. We extend an existing HLS tool-chain to support Paraver - a state-of-the-art visualization and profiling tool well-known in HPC. We describe how each of the events and states are collected, and empirically quantify its hardware overhead. Finally, we practically apply our contribution to two different applications, demonstrating how the tool can be used to provide unique insights into application execution and how it can be used to guide optimizations. In this work, we hypothesize that existing profiling and visualization tools used in the HPC domain are also usable for understanding performance on FPGAs. We extend an existing HLS tool-chain to support Paraver - a state-of-the-art visualization and profiling tool well-known in HPC. We describe how each of the events and states are collected, and empirically quantify its hardware overhead. Finally, we practically apply our contribution to two different applications, demonstrating how the tool can be used to provide unique insights into application execution and how it can be used to guide optimizations.

[1]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[2]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[3]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[4]  Mats Brorsson,et al.  Empowering OpenMP with automatically generated hardware , 2016, 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS).

[5]  Jan Langer,et al.  OmpSs@Zynq all-programmable SoC ecosystem , 2014, FPGA.

[6]  Dirk Schmidl,et al.  Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir , 2011, Parallel Tools Workshop.

[7]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[8]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[9]  Satoshi Matsuoka,et al.  Designing and accelerating spiking neural networks using OpenCL for FPGAs , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[10]  Jiayi Sheng,et al.  Fully Integrated On-FPGA Molecular Dynamics Simulations , 2019, ArXiv.

[11]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[12]  Andreas Koch,et al.  Optimized high-level synthesis of SMT multi-threaded hardware accelerators , 2015, 2015 International Conference on Field Programmable Technology (FPT).

[13]  Andreas Koch,et al.  Automatic high-level synthesis of multi-threaded hardware accelerators , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[14]  Eriko Nurvitadhi,et al.  Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 International Conference on Field-Programmable Technology (FPT).

[15]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[16]  Charles E. Leiserson,et al.  Optimizing Synchronous Circuitry by Retiming (Preliminary Version) , 1983 .

[17]  Nikolaos Bellas,et al.  SoCLog: A real-time, automatically generated logging and profiling mechanism for FPGA-based Systems On Chip , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[18]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[19]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[20]  Alan D. George,et al.  Communication visualization for bottleneck detection of high-level synthesis applications , 2012, FPGA '12.

[21]  Mats Brorsson,et al.  Grain graphs: OpenMP performance analysis made easy , 2016, PPoPP.

[22]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[23]  Andreas Koch,et al.  Hardware/software co-compilation with the Nymble system , 2013, 2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC).

[24]  Jason Helge Anderson,et al.  Source-level debugging for FPGA high-level synthesis , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[25]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[26]  Alan D. George,et al.  ACM Transactions on Reconfigurable Technology and Systems Performance Analysis Framework for High-Level Language Applications in Reconfigurable Computing , 2009 .

[27]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[28]  Bernd Hamann,et al.  State of the Art of Performance Visualization , 2014, EuroVis.