Energy optimizations for FPGA-based 2-D FFT architecture

Row-column algorithm is commonly used for 2-D FFT implementation on FPGA. However, in this algorithm, the strided memory access to external memory such as DRAM introduces significant delay for DRAM row activation, thus resulting in high DRAM energy and a significant amount of FPGA device energy consumed in idle state. In this paper, to optimize energy consumption of the 2-D FFT architecture, we employ an FPGA-based 1-D FFT kernel supporting processing streaming data to fully utilize the available bandwidth offered by the external memory , and balance the I/O bandwidth between the DRAM and FPGA to minimize the FPGA idle time. Furthermore, to avoid time consuming DRAM row activation, we decompose the required transposition by column-wise FFTs into smaller size problems, thus enabling on-chip local transposition which could be performed by the customized data permutation unit used in the 1-D FFT kernel. Compared with the baseline 2-D FFT architecture, the optimized architecture achieves 3.9×, 4.2× and 4.5× improvement in energy efficiency for 1024×1024, 4096 × 4096 and 8192 × 8192 points 2-D FFTs, respectively. We also estimate the peak energy efficiency of the FPGA-based 2-D FFT architecture. Our estimation shows that our optimized 2-D FFT Kernel can achieve 8.06 ~ 8.31 GFLOPS/W for various 2-D FFTs, ie., up to 62% of the peak energy efficiency of 2-D FFT architecture on FPGA.

[1]  Hong Ren Wu,et al.  The structure of vector radix fast Fourier transforms , 1989, IEEE Trans. Acoust. Speech Signal Process..

[2]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  C. K. Yuen,et al.  Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Viktor K. Prasanna,et al.  Energy-efficient architecture for stride permutation on streaming data , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[6]  Peter Pirsch,et al.  Using SDRAMs for two-dimensional accesses of long 2n × 2m-point FFTs and transposing , 2011, 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[7]  Viktor K. Prasanna,et al.  High throughput energy efficient parallel FFT architecture on FPGAs , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[8]  Viktor K. Prasanna,et al.  Energy efficient parameterized FFT architecture , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[9]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[10]  Franz Franchetti,et al.  Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.

[11]  Narayanan Vijaykrishnan,et al.  FPGA Architecture for 2D Discrete Fourier Transform Based on 2D Decomposition for Large-sized Data , 2009, 2009 IEEE Workshop on Signal Processing Systems.

[12]  Chunming Zhang,et al.  Accelerating 2D FFT with Non-Power-of-Two Problem Size on FPGA , 2010, 2010 International Conference on Reconfigurable Computing and FPGAs.

[13]  Franz Franchetti,et al.  Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[14]  E. V. Jones,et al.  A pipelined FFT processor for word-sequential data , 1989, IEEE Trans. Acoust. Speech Signal Process..

[15]  Yu Zhang,et al.  A power and temperature aware DRAM architecture , 2008, 2008 45th ACM/IEEE Design Automation Conference.