Parallel Programming for FPGAs

This book focuses on the use of algorithmic high-level synthesis (HLS) to build application-specific FPGA systems. Our goal is to give the reader an appreciation of the process of creating an optimized hardware design using HLS. Although the details are, of necessity, different from parallel programming for multicore processors or GPUs, many of the fundamental concepts are similar. For example, designers must understand memory hierarchy and bandwidth, spatial and temporal locality of reference, parallelism, and tradeoffs between computation and storage. This book is a practical guide for anyone interested in building FPGA systems. In a university environment, it is appropriate for advanced undergraduate and graduate courses. At the same time, it is also useful for practicing system designers and embedded programmers. The book assumes the reader has a working knowledge of C/C++ and includes a significant amount of sample code. In addition, we assume familiarity with basic computer architecture concepts (pipelining, speedup, Amdahl's Law, etc.). A knowledge of the RTL-based FPGA design flow is helpful, although not required.

[1]  Ryan Kastner,et al.  Enabling FPGAs for the Masses , 2014, ArXiv.

[2]  Sumit Gupta,et al.  SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits , 2004 .

[3]  Vaughn Betz,et al.  VPR: A new packing, placement and routing tool for FPGA research , 1997, FPL.

[4]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[5]  Jean-Michel Muller,et al.  The CORDIC Algorithm: New Results for Fast VLSI Implementation , 1993, IEEE Trans. Computers.

[6]  Edward Ashford Lee,et al.  Plato and the nerd: the creative partnership of humans and technology (mit press) , 2017 .

[7]  Florent de Dinechin,et al.  Floating-Point Trigonometric Functions for FPGAs , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[8]  David L. Andrews,et al.  A Streaming High-Throughput Linear Sorter System with Contention Buffering , 2011, Int. J. Reconfigurable Comput..

[9]  Giovanni De Micheli,et al.  Synthesis and Optimization of Digital Circuits , 1994 .

[10]  Adrián Cristal,et al.  An empirical evaluation of High-Level Synthesis languages and tools for database acceleration , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[11]  Jean-Marc Delosme,et al.  Highly concurrent computing structures for matrix arithmetic and signal processing , 1982, Computer.

[12]  Ryan Kastner,et al.  Energy efficient canonical huffman encoding , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[13]  Marina Schmid Behavioral Synthesis Digital System Design Using The Synopsys Behavioral Compiler , 2016 .

[14]  Ryan Kastner,et al.  FPGA Implementation of High Speed FIR Filters Using Add and Shift Method , 2006, 2006 International Conference on Computer Design.

[15]  Pingfan Meng,et al.  Designing a hardware in the loop wireless digital channel emulator for software defined radio , 2012, 2012 International Conference on Field-Programmable Technology.

[16]  Guy Lemieux,et al.  Modular multi-ported SRAM-based memories , 2014, FPGA.

[17]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Jonathan Rose,et al.  CALL FOR ARTICLES IEEE Design & Test of Computers Special Issue on Microprocessors , 1996 .

[19]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[20]  J. Gregory Steffan,et al.  Composing Multi-Ported Memories on FPGAs , 2014, TRETS.

[21]  Jason Cong,et al.  FPGA-accelerated 3D reconstruction using compressive sensing , 2012, FPGA '12.

[22]  Horácio C. Neto,et al.  Sorting Units for FPGA-Based Embedded Systems , 2008, DIPES.

[23]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[24]  Alvin M. Despain,et al.  Fourier Transform Computers Using CORDIC Iterations , 1974, IEEE Transactions on Computers.

[25]  Edward A. Lee,et al.  Pipeline interleaved programmable DSP's: Architecture , 1987, IEEE Trans. Acoust. Speech Signal Process..

[26]  Shreesha Srinath,et al.  Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis , 2017, FPGA.

[27]  Edward A. Lee,et al.  Structure and interpretation of signals and systems , 2002 .

[28]  Ryan Kastner,et al.  Resolve: Generation of High-Performance Sorting Architectures from High-Level Synthesis , 2016, FPGA.

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  Erik Reinhard,et al.  Color imaging , 2009, SIGGRAPH '09.

[31]  W. M. Gentleman,et al.  Fast Fourier Transforms: for fun and profit , 1966, AFIPS '66 (Fall).

[32]  Marios C. Papaefthymiou Understanding retiming through maximum average-weight cycles , 1991, SPAA '91.

[33]  Y. R. Storch‐Rudall Arithmetic Optimization techniques for Hardware and Software Design , 2011 .

[34]  George A. Constantinides,et al.  Optimizing SDRAM bandwidth for custom FPGA loop accelerators , 2012, FPGA '12.

[35]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[36]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[37]  Daniel D. Gajski,et al.  High ― Level Synthesis: Introduction to Chip and System Design , 1992 .

[38]  Don H. Johnson,et al.  Gauss and the history of the fast Fourier transform , 1984, IEEE ASSP Magazine.

[39]  Ryan Kastner,et al.  High throughput channel tracking for JTRS wireless channel emulation , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[40]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[41]  Gustavo Alonso,et al.  Sorting networks on FPGAs , 2012, The VLDB Journal.

[42]  George A. Constantinides,et al.  High-level synthesis of dynamic data structures: A case study using Vivado HLS , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[43]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[44]  Pat Hanrahan,et al.  Rigel , 2016, ACM Trans. Graph..

[45]  Zhiru Zhang,et al.  Architecture and Synthesis for Area-Efficient Pipelining of Irregular Loop Nests , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[46]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[47]  Jürgen Teich,et al.  Tradeoff analysis and architecture design of a hybrid hardware/software sorter , 2000, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors.

[48]  Kunle Olukotun,et al.  Hardware system synthesis from Domain-Specific Languages , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[49]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.