A reconfigurable macro-pipelined systolic accelerator architecture

In this paper, we propose a reconfigurable macro-pipelined systolic architecture (MAPS), which aims to accelerate multiply-accumulate based algorithms by exploiting the temporal parallelism. To illustrate the performance, we implement a 32-PE accelerator on the Xilinx ML605 experiment board for the matrix multiplication and get a peak performance of 51.2 GFLOPS (about 8.0 GFLOPS per PE per GHz). To demonstrate the generality for different algorithms, the 2-dimensional convolution is also implemented on the MAPS. Moreover, the proposed MAPS architecture has the excellent scalability, which is able to scale up to hundreds of GFLOPS using multiple FPGA devices.

[1]  Yifeng Chen,et al.  Improving Performance of Matrix Multiplication and FFT on GPU , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[2]  Anil K. Jain,et al.  Convolution on Splash 2 , 1995, Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[3]  Yu Hen Hu,et al.  A novel modular systolic array architecture for full-search block matching motion estimation , 1995, IEEE Trans. Circuits Syst. Video Technol..

[4]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[5]  Yu Hen Hu,et al.  A novel modular systolic array architecture for full-search block matching motion estimation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jiang Jiang,et al.  Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture , 2009, 2009 International Conference on Reconfigurable Computing and FPGAs.

[7]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[8]  Jae-Jin Lee,et al.  Super-Systolic Array for 2D Convolution , 2006, TENCON 2006 - 2006 IEEE Region 10 Conference.