A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition

Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain of digital signal processing, and the FFT sizes of different applications are very different. Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range from 2 to 220 points. First, a parallel Cooley–Tukey FFT algorithm based on matrix transposition (MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is designed, and several FFT performance optimization techniques such as hybrid twiddle factor generation, multibank data memory, block MT, and token-based task scheduling are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of 2.4 mm2 and the power consumption of 91.3 mW at 25 °C, 0.9 V. Finally, several experiments are carried out to evaluate the proposal’s performance in terms of FFT execution time, resource utilization, and power consumption. Comparative experiments show that our FFT hardware accelerator achieves at most $18.89\times $ speedups in comparison to two software-only solutions and two hardware-dedicated solutions.

[1]  Viktor K. Prasanna,et al.  High throughput energy efficient parallel FFT architecture on FPGAs , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[2]  Yong Dou,et al.  MEMS-Based Electrostatic Actuated Scanner Dedicated for Ultrasound Sensors , 2007 .

[3]  R. Keith Raney,et al.  Precision SAR processing using chirp scaling , 1994, IEEE Trans. Geosci. Remote. Sens..

[4]  Chu Yu,et al.  Area-Efficient 128- to 2048/1536-Point Pipeline FFT Processor for LTE and Mobile WiMAX Systems , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  Sau-Gee Chen,et al.  An efficient FFT twiddle factor generator , 2004, 2004 12th European Signal Processing Conference.

[6]  James C. Hoe,et al.  Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Lei Guo,et al.  Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix Access , 2013, IEICE Trans. Inf. Syst..

[8]  In-Cheol Park,et al.  Balanced Binary-Tree Decomposition for Area-Efficient Pipelined FFT Processing , 2007, IEEE Transactions on Circuits and Systems I: Regular Papers.

[9]  K. Sridharan,et al.  50 Years of CORDIC: Algorithms, Architectures, and Applications , 2009, IEEE Transactions on Circuits and Systems I: Regular Papers.

[10]  Javier D. Bruguera,et al.  Very-High Radix Circular CORDIC: Vectoring and Unified Rotation/Vectoring , 2000, IEEE Trans. Computers.

[11]  Chao Lu,et al.  Cooley-Tukey FFT Algorithms , 1989 .

[12]  Song-Nien Tang,et al.  An Area- and Energy-Efficient Multimode FFT Processor for WPAN/WLAN/WMAN Systems , 2012, IEEE Journal of Solid-State Circuits.

[13]  R. Adrian Twenty years of particle image velocimetry , 2005 .

[14]  Lei Guo,et al.  VLIW coprocessor for IEEE-754 quadruple-precision elementary functions , 2013, ACM Trans. Archit. Code Optim..

[15]  Julio Villalba,et al.  Redundant Floating-Point Decimal CORDIC Algorithm , 2012, IEEE Transactions on Computers.

[16]  Jesús Grajal,et al.  Pipelined Radix-$2^{k}$ Feedforward FFT Architectures , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[17]  Mathini Sellathurai,et al.  A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18]  Keshab K. Parhi,et al.  Pipelined Parallel FFT Architectures via Folding Transformation , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[19]  Sau-Gee Chen,et al.  A High-Throughput Radix-16 FFT Processor With Parallel and Normal Input/Output Ordering for IEEE 802.15.3c Systems , 2012, IEEE Transactions on Circuits and Systems I: Regular Papers.

[20]  Mingyu Wang,et al.  A pipelined area-efficient and high-speed reconfigurable processor for floating-point FFT/IFFT and DCT/IDCT computations , 2016, Microelectron. J..

[21]  Yi-Jun Liu,et al.  Efficient Memory-Addressing Algorithms for FFT Processor Design , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[22]  Constantinos E. Goutis,et al.  A floating point pipeline CORDIC processor with extended operation set , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[23]  James R. Geraci,et al.  A transpose-free in-place SIMD optimized FFT , 2012, TACO.

[24]  Brent E. Nelson,et al.  A Parallel FFT Architecture for FPGAs , 2004, FPL.

[25]  Yong Dou,et al.  FPGA SAR Processor with Window Memory Accesses , 2007, 2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP).

[26]  Debjit Das Sarma,et al.  Faithful bipartite ROM reciprocal tables , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[27]  C. Rader,et al.  A new principle for fast Fourier transformation , 1976 .

[28]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[29]  Ed F. Deprettere,et al.  Pipelined cordic architectures for fast VLSI filtering and array processing , 1984, ICASSP.

[30]  Yu Hen Hu,et al.  The quantization effects of the CORDIC algorithm , 1992, IEEE Trans. Signal Process..

[31]  Myung Hoon Sunwoo,et al.  Novel Shared Multiplier Scheduling Scheme for Area-Efficient FFT/IFFT Processors , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[32]  Lei Guo,et al.  Transpose-free variable-size FFT accelerator based on-chip SRAM , 2014, IEICE Electron. Express.

[33]  R. Kunemund,et al.  CORDIC Processor with Carry-Save Architecture , 1990, ESSCIRC '90: Sixteenth European Solid-State Circuits Conference.

[34]  Sau-Gee Chen,et al.  Design of an efficient variable-length FFT processor , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[35]  Hanho Lee,et al.  A High-Speed Low-Complexity Modified ${\rm Radix}-2^{5}$ FFT Processor for High Rate WPAN Applications , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[36]  Shuming Chen,et al.  FT-Matrix: A Coordination-Aware Architecture for Signal Processing , 2014, IEEE Micro.

[37]  Xingjian Li,et al.  Floating-point mixed-radix FFT core generation for FPGA and comparison with GPU and CPU , 2011, 2011 International Conference on Field-Programmable Technology.

[38]  Victor E. DeBrunner,et al.  A high throughput and low power radix-4 FFT architecture , 2014, 2014 48th Asilomar Conference on Signals, Systems and Computers.

[39]  Ke Wang,et al.  Automatic FFT Performance Tuning on OpenCL GPUs , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[40]  W. Brown Synthetic Aperture Radar , 1967, IEEE Transactions on Aerospace and Electronic Systems.