Prototype implementation of array-processor extensible over multiple FPGAs for scalable stencil computation

This paper demonstrates and evaluates the performance and the scalability of the systolic computational-memory array (SCMA) for stencil computation, which is a typical computing kernel of scientific simulation. We describe the basic architecture of th SCMA, and show the requirements and the design of SCMAs to scalably operate over multiple devices. We implement a prototype of the SCMA with three ALTERA Stratix III FPGAs, which form a 1--3 FPGA array by conecting three DE3 boards with different clock sources. The prototype SCMA demonstrates that the difference in operating clock frequency hardly influences the total execution cycles while it slightly causes stall cycles to sub-SCMAs on different FPGAs. With three banchmark programs of typical computing kernels based on the finite difference method, we show that the increased FPGAs provide higher performance proportional to the number of devices, resulting in almost linear speedup.

[1]  Duncan G. Elliott,et al.  Computational RAM: Implementing Processors in Memory , 1999, IEEE Des. Test Comput..

[2]  Noah Treuhaft,et al.  Intelligent RAM (IRAM): the industrial setting, applications, and architectures , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[3]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[4]  Michael Smith ASIC technologies , 1992, [1992] Proceedings. Fifth Annual IEEE International ASIC Conference and Exhibit.

[5]  Chen Chang,et al.  BEE3: Revitalizing Computer Architecture Research , 2009 .

[6]  H. T. Kung Why systolic architectures? , 1982, Computer.

[7]  Stephen Booth,et al.  Maxwell - a 64 FPGA Supercomputer , 2007, Second NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2007).

[8]  Wang Chen,et al.  An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm , 2004, FPGA '04.

[9]  Luzhou Wang,et al.  Scalable FPGA-array for high-performance and power-efficient computation based on difference schemes , 2008, 2008 Second International Workshop on High-Performance Reconfigurable Computing Technology and Applications.

[10]  Wayne Luk,et al.  Cube: A 512-FPGA cluster , 2009, 2009 5th Southern Conference on Programmable Logic (SPL).

[11]  William D. Smith,et al.  Towards an RCC-Based Accelerator for Computational Fluid Dynamics Applications , 2004, The Journal of Supercomputing.

[12]  Hideharu Amano,et al.  Exploiting memory hierarchy for a Computational Fluid Dynamics accelerator on FPGAs , 2008, 2008 International Conference on Field-Programmable Technology.

[13]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14]  Satoru Yamamoto,et al.  FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods , 2010, TRETS.

[15]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Mark Shand,et al.  Programmable active memories: reconfigurable systems come of age , 1996, IEEE Trans. Very Large Scale Integr. Syst..

[17]  Ali R. Hurson,et al.  General-purpose systolic arrays , 1993, Computer.

[18]  Satoru Yamamoto,et al.  Systolic Architecture for Computational Fluid Dynamics on FPGAs , 2007, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007).

[19]  Dennis W. Prather,et al.  FPGA-based acceleration of the 3D finite-difference time-domain method , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[20]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.