Coarse-grain pipelining on multiple FPGA architectures

Reconfigurable systems, and in particular, FPGA-based custom computing machines, offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages for application domains such as image processing, where the use of customized pipelines exploits the inherent coarse-grain parallelism. In this paper we describe a set of program analyses and an implementation that map a sequential and un-annotated C program into a pipelined implementation running on a set of FPGAs, each with multiple external memories. Based on well-known parallel computing analysis techniques, our algorithms perform unrolling for operator parallelization, reuse and data layout for memory parallelization and precise communication analysis. We extend these techniques for FPGA-based systems to automatically partition the application data and computation into custom pipeline stages, taking into account the available FPGA and interconnect resources. We illustrate the analysis components by way of an example, a machine vision program. We present the algorithm results, derived with minimal manual intervention, which demonstrate the potential of this approach for automatically deriving pipelined designs from high-level sequential specifications.

[1]  Heidi E. Ziegler,et al.  Parallelization and Locality Analysis for Adaptive Computing Systems , 1999 .

[2]  Praveen K. Murthy,et al.  A buffer merging technique for reducing memory requirements of synchronous dataflow specifications , 1999, Proceedings 12th International Symposium on System Synthesis.

[3]  Maya Gokhale,et al.  Stream-oriented FPGA computing in the Streams-C high level language , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[4]  Seth Copen Goldstein,et al.  PipeRench: a co/processor for streaming multimedia acceleration , 1999, ISCA.

[5]  Robert Rinker,et al.  An automated process for compiling dataflow graphs into reconfigurable hardware , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[6]  Ken Kennedy,et al.  A technique for summarizing data access and its use in parallelism enhancing transformations , 1989, PLDI '89.

[7]  Maya Gokhale,et al.  Automatic allocation of arrays to memories in FPGA processors with multiple memory banks , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[8]  Monica S. Lam,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[9]  Pedro C. Diniz,et al.  A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[10]  John G. Proakis,et al.  Digital signal processing - principles, algorithms and applications (2. ed.) , 1992 .

[11]  José Santos-Victor,et al.  Underwater Video Mosaics as Visual Navigation Maps , 2000, Comput. Vis. Image Underst..

[12]  Maya Gokhale,et al.  NAPA C: compiling for a hybrid RISC/FPGA architecture , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[13]  Guang R. Gao,et al.  Software pipelining showdown: optimal vs. heuristic methods in a production compiler , 1996, PLDI '96.

[14]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[15]  Carl Ebeling,et al.  Specifying and compiling applications for RaPiD , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[16]  Pedro C. Diniz,et al.  Data reorganization engines for the next generation of system-on-a-chip FPGAs , 2002, FPGA '02.

[17]  Pedro C. Diniz,et al.  Bridging the Gap between Compilation and Synthesis in the DEFACTO System , 2001, LCPC.

[18]  Wayne Luk,et al.  Pipeline vectorization for reconfigurable systems , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[19]  John G. Proakis,et al.  Digital Signal Processing: Principles, Algorithms, and Applications , 1992 .

[20]  Pedro C. Diniz,et al.  Automatic synthesis of data storage and control structures for FPGA-based computing engines , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[21]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[22]  Saman P. Amarasinghe,et al.  Maps: a compiler-managed memory system for Raw machines , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[23]  John Wawrzynek,et al.  Adapting software pipelining for reconfigurable computing , 2000, CASES '00.