Modeling and Implementing High Performance Programs on FPGA

This work investigates the potential of high performance computing (HPC) on fieldprogrammable gate arrays (FPGAs), highlighting concepts and programming techniques to pursue performance using high level synthesis (HLS) tools. We compute the peak single precision floating point performance on the AlphaData 7V3 board using a model of replicated processing elements, then implement a benchmark to verify the predicted performance in hardware, using both the SDAccel framework and a custom reference design provided by Xilinx. The benchmarks reach 302 GOp/s and 548 GOp/s on the two platforms, respectively. The techniques are applied to the field of stencil computations, proposing a temporally pipelined streaming design for the 2D Jacobian stencil that scales with available area on the chip, by using on-chip memory to buffer the incoming wavefront, achieving a sustained performance of 256 GOp/s on a 256× 256 grid. Finally the current state of FPGAs is discussed based on the results obtained, and comments are made on the future of reconfigurable computing in HPC.

[1]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[2]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[3]  R. Wittig,et al.  Evaluating FPGAs for floating-point performance , 2008, 2008 Second International Workshop on High-Performance Reconfigurable Computing Technology and Applications.

[4]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[5]  Allen Taflove,et al.  Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .

[6]  Wu-chun Feng,et al.  Accelerating Workloads on FPGAs via OpenCL: A Case Study with OpenDwarfs , 2016 .

[7]  G. Doms,et al.  The Nonhydrostatic Limited-Area Model LM (Lokal-Modell) of DWD: Part I: Scientific Documentation (Ve , 1999 .

[8]  H. T. Kung,et al.  Systolic VLSI Arrays for Polynomial GCD Computation , 1984, IEEE Transactions on Computers.

[9]  P. Sadayappan,et al.  Effective resource management for enhancing performance of 2D and 3D stencils on GPUs , 2016, GPGPU@PPoPP.

[10]  Torsten Hoefler,et al.  MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures , 2015, ICS.

[11]  Tsutomu Maruyama,et al.  A Cellular Automata System with FPGA , 2001, The 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'01).

[12]  Uday Bondhugula,et al.  PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System , 2015 .

[13]  A Thesis,et al.  Tiling Stencil Computations to Maximize Parallelism , 2013 .

[14]  G. Smith,et al.  Numerical Solution of Partial Differential Equations: Finite Difference Methods , 1978 .

[15]  Kenneth O'Brien,et al.  A Semi-Automated Tool Flow for Roofline Anaylsis of OpenCL Kernels on Accelerators , 2015 .

[16]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[17]  Tomofumi Yuki,et al.  Towards Scalable and Efficient FPGA Stencil Accelerators , 2016, HiPEAC 2016.

[18]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[19]  Satoru Yamamoto,et al.  Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[20]  Bruno da Silva,et al.  Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools , 2013, Int. J. Reconfigurable Comput..