Accelerating 3D-FFT Using Hard Embedded Blocks in FPGAs

Three dimensional Fast Fourier Transform (3D-FFT) is popularly used in many scientific applications in various domains like image processing, bioinformatics and molecular dynamics. Typically 3D-FFT computation takes significant part of the execution time of these applications. In order to speedup these applications, it becomes necessary to accelerate 3D-FFT computation. 3D-FFT can be accelerated using Field Programmable Gate Array (FPGA) based accelerators. But speedup always may not be possible as FPGAs run at slower clock frequency vis-a-vis processors and the resources available in an FPGA device might not be sufficient for the implementation of a sufficient number of copies of the processing elements to compensate for the loss of clock frequency. FPGAs with heterogeneous mix of coarse grained hard blocks along with programmable soft logic, can facilitate implementing a much larger number of processing elements and thus achieve much higher speedups. Modern FPGAs do consist of different heterogeneous hard embedded blocks (HEBs) like multipliers, DSP blocks and memory units. It is easy to predict that many more such hard blocks will be embedded into future FPGAs. The evaluation approach to identify and incorporate HEBs is complex as there are many parameters and constraints like area, granularity routing resources, etc. that need to be considered in an integrated manner to get an efficient implementation. In this paper we show acceleration of 3D-FFT using future fabrics incorporating HEBs. By using these fabrics we show speedups of upto 1900x for 2048 point FFT. We also present an evaluation methodology to design future FPGA fabrics incorporating accelerators as hard embedded blocks. This methodology will be useful for i selection of blocks to be embedded into the fabric and ii evaluating the performance gain that can be achieved by such an embedding.

[1]  Satoshi Matsuoka,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Nachiket Kapre,et al.  SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator , 2011 .

[3]  Wayne Luk,et al.  Domain-Specific Hybrid FPGA: Architecture and Floating Point Applications , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[4]  Toshio Endo,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, HiPC 2008.

[5]  Pradeep Dubey,et al.  High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures , 2011, Int. J. Biomed. Imaging.

[6]  Wayne Luk,et al.  Optimizing coarse-grained units in floating point hybrid FPGA , 2008, 2008 International Conference on Field-Programmable Technology.

[7]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[8]  Wayne Luk,et al.  Virtual Embedded Blocks: A Methodology for Evaluating Embedded Elements in FPGAs , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[9]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[10]  Brent E. Nelson,et al.  Novel Optimizations for Hardware Floating-Point Units in a Modern FPGA Architecture , 2002, FPL.

[11]  B. Flachs,et al.  A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[12]  Kenneth B. Kent,et al.  VPR 5.0: FPGA CAD and architecture exploration tools with single-driver routing, heterogeneity and process scaling , 2011, TRETS.

[13]  Francesc X. Avilés,et al.  A Computational System for Modeling Flexible Protein-Protein and protein-DNA Docking , 1998, ISMB.

[14]  Stephen R. Comeau,et al.  PIPER: An FFT‐based protein docking program with pairwise potentials , 2006, Proteins.

[15]  Wayne Luk,et al.  A synthesizable datapath-oriented embedded FPGA fabric , 2007, FPGA '07.

[16]  Karl S. Hemmert,et al.  Embedded floating-point units in FPGAs , 2006, FPGA '06.

[17]  Andrew B. Kahng,et al.  A power-constrained MPU roadmap for the International Technology Roadmap for Semiconductors (ITRS) , 2009, 2009 International SoC Design Conference (ISOCC).