An FPGA-based computation model for blocked algorithms

Long running applications typically involves huge amount of iterations of loops with or without loop dependencies over a set of data that may not be loaded to memory as a whole. In this paper, an FPGA- based computation model is proposed to optimize this type of application that requires extensively load/store a block of data in a loop by utilizing the flexibility of the FPGA-based re-configurable architecture. The kernel of the computation model is a set of computation cores, each of which consists of three pipeline stages and dual buffers to shorten memory access latency. Multiple cores can be configured to operate over a set of data in a buffer, similar to the traditional multiple instructions with single data stream (SIMD) concept. Preliminary results in terms of representative examples such as pattern matching algorithms and QR matrix decomposition algorithms are presented.

[1]  Ken Kennedy,et al.  Automatic blocking of QR and LU factorizations for locality , 2004, MSP '04.

[2]  Pedro C. Diniz,et al.  Automatic synthesis of data storage and control structures for FPGA-based computing engines , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[3]  Pedro C. Diniz,et al.  Performance and area modeling of complete FPGA designs in the presence of loop transformations , 2004, IEEE Transactions on Computers.

[4]  Pedro C. Diniz,et al.  Application-specific external memory interfacing for fpga-based reconfigurable architecture , 2004 .

[5]  Pedro C. Diniz,et al.  Synthesis of pipelined memory access controllers for streamed data applications on FPGA-based computing engines , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[6]  Christian H. Bischof,et al.  A block QR factorization algorithm using restricted pivoting , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[7]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[8]  Pedro C. Diniz,et al.  Performance and area modeling of complete FPGA designs in the presence of loop transformations , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[9]  Wayne Luk,et al.  Pipeline vectorization for reconfigurable systems , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[10]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.