High-level synthesis of functional patterns with Lift

High-level languages are commonly seen as a good fit to tackle the problem of performance portability across parallel architectures. The Lift framework is a recent approach which combines high-level, array-based programming abstractions, with a system of rewrite-rules that express algorithmic as well as low-level hardware optimizations. Lift has successfully demonstrated its ability to address the challenge of performance portability across multiple types of CPU and GPU devices by automatically generating code that is on-par with highly optimized hand-written code. This paper demonstrates the potential of Lift for targeting FPGA-based platforms. It presents the design of new Lift parallel patterns operating on data streams, and describes the implementation of a Lift VHDL backend. This approach is evaluated on a Xilinx XC7Z010 FPGA using matrix multiplication, leading to a 10x speed-up over highly optimized CPU code and a commercial HLS tool. Furthermore, by considering the potential of design space exploration enabled by Lift, this work is a stepping stone towards automatically generated competitive code for FPGAs.

[1]  Kunle Olukotun,et al.  Hardware system synthesis from Domain-Specific Languages , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[2]  Miriam Leeser,et al.  HML: an innovative hardware description language and its translation to VHDL , 1995, Proceedings of ASP-DAC'95/CHDL'95/VLSI'95 with EDA Technofair.

[3]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4]  Mary Sheeran,et al.  Lava: hardware design in Haskell , 1998, ICFP '98.

[5]  Yu Ting Chen,et al.  A Survey and Evaluation of FPGA High-Level Synthesis Tools , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.

[7]  Stephen A. Edwards,et al.  From functional programs to pipelined dataflow circuits , 2017, CC.

[8]  Sam Lindley,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[9]  Andy Gill Declarative FPGA Circuit Synthesis using Kansas Lava , 2011 .

[10]  Jan Kuper,et al.  C?aSH: Structural Descriptions of Synchronous Hardware Using Haskell , 2010, 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools.

[11]  Stephen A. Edwards,et al.  High-Level Synthesis from the Synchronous Language Esterel , 2002, IWLS.

[12]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[13]  John Wawrzynek,et al.  Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[14]  David F. Bacon,et al.  FPGA programming for the masses , 2013, CACM.

[15]  Richard Sharp,et al.  A Statically Allocated Parallel Functional Language , 2000, ICALP.

[16]  Scott A. Mahlke,et al.  Optimus: efficient realization of streaming applications on FPGAs , 2008, CASES '08.

[17]  Rishiyur S. Nikhil,et al.  Bluespec: A General-Purpose Approach to High-Level Synthesis Based on Parallel Atomic Transactions , 2008 .

[18]  Kunle Olukotun,et al.  Generating Configurable Hardware from Parallel Patterns , 2015, International Conference on Architectural Support for Programming Languages and Operating Systems.

[19]  Jason Cong,et al.  FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[20]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[21]  Shan Shan Huang,et al.  Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary , 2008, ECOOP.

[22]  Mary Sheeran,et al.  muFP, a language for VLSI design , 1984, LFP '84.

[23]  Andrew M. Wallace,et al.  Profile Guided Dataflow Transformation for FPGAs and CPUs , 2017, J. Signal Process. Syst..

[24]  Michel Steuwer,et al.  Matrix multiplication beyond auto-tuning: Rewrite-based GPU code generation , 2016, 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

[25]  Soonhoi Ha,et al.  Optimized RTL Code Generation from Coarse-Grain Dataflow Specification for Fast HW/SW Cosynthesis , 2008, J. Signal Process. Syst..