Evaluating heuristics in automatically mapping multi-loop applications to FPGAs

This paper presents a set of measurements which characterize the design space for automatically mapping high-level algorithms consisting of multiple loop nests, expressed in C, onto an FPGA. We extend a prior compiler algorithm that derived optimized FPGA implementations for individual loop nests. We focus on the space-time tradeoffs associated with sharing constrained chip area among multiple computations represented by an asynchronous pipeline. Intermediate results are communicated on chip; communication analysis generates this communication automatically. Other analyses and transformations, also associated with parallelizing compiler technology, are used to perform high-level optimization of the designs. We vary the amount of parallelism in individual loop nests with the goal of deriving an overall design that makes the most effective use of chip resources. We describe several heuristics for automatically searching the space and a set of metrics for evaluating and comparing designs. From results obtained through an automated process, we demonstrate that heuristics derived through sophisticated compiler analysis are the most effective at navigating this complex search space, particularly for more complex applications.

[1]  Susmita Sur-Kolay,et al.  Combined instruction and loop parallelism in array synthesis for FPGAs , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[2]  Pedro C. Diniz,et al.  Coarse-grain pipelining on multiple FPGA architectures , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[3]  Wayne Luk,et al.  Pipeline vectorization for reconfigurable systems , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[4]  Bruce A. Draper,et al.  The Cameron Project: High-Level Programming of Image Processing Applications on Reconfigurable Computing Machines 1 , 1998 .

[5]  Ken Kennedy,et al.  A technique for summarizing data access and its use in parallelism enhancing transformations , 1989, PLDI '89.

[6]  Pedro C. Diniz,et al.  Compiler reuse analysis for the mapping of data in FPGAs with RAM blocks , 2004, Proceedings. 2004 IEEE International Conference on Field- Programmable Technology (IEEE Cat. No.04EX921).

[7]  Pedro C. Diniz,et al.  Compiler-generated communication for pipelined FPGA applications , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[8]  Carl Ebeling,et al.  Specifying and compiling applications for RaPiD , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[9]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[10]  Pedro C. Diniz,et al.  Using estimates from behavioral synthesis tools in compiler-directed design space exploration , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[11]  Mary W. Hall,et al.  Custom data layout for memory parallelism , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[12]  Wayne Luk,et al.  Source-directed transformations for hardware compilation , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[13]  Pedro C. Diniz,et al.  Bridging the Gap between Compilation and Synthesis in the DEFACTO System , 2001, LCPC.

[14]  Apan Qasem,et al.  Improving Performance with Integrated Program Transformations , 2004 .

[15]  Saman Amarasinghe,et al.  Parallelizing Compiler Techniques Based on Linear Inequalities , 1997 .

[16]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[17]  K. Kennedy,et al.  Preliminary experiences with the Fortran D compiler , 1993, Supercomputing '93.

[18]  Pedro C. Diniz,et al.  A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[19]  Mary W. Hall,et al.  Search Space Properties for Mapping Coarse-Grain Pipelined FPGA Applications , 2003, LCPC.

[20]  Mary Hall,et al.  An efficient design space exploration for balance between computation and memory , 2003 .

[21]  John Wawrzynek,et al.  Adapting software pipelining for reconfigurable computing , 2000, CASES '00.

[22]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[23]  B. Ramakrishna Rau,et al.  Efficient design space exploration in PICO , 2000, CASES '00.

[24]  Seth Copen Goldstein,et al.  PipeRench: a co/processor for streaming multimedia acceleration , 1999, ISCA.

[25]  Pedro C. Diniz,et al.  A register allocation algorithm in the presence of scalar replacement for fine-grain configurable architectures , 2005, Design, Automation and Test in Europe.

[26]  R. Ferreira,et al.  Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[27]  Csaba Andras Moritz,et al.  Parallelizing applications into silicon , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[28]  Mary W. Hall,et al.  Increasing the Applicability of Scalar Replacement , 2004, CC.

[29]  Jeffrey M. Arnold The Splash 2 software environment , 2005, The Journal of Supercomputing.

[30]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[31]  Ken Kennedy,et al.  Scalar replacement in the presence of conditional control flow , 1994, Softw. Pract. Exp..