论文信息 - Efficient Large-Scale 1D FFT Vectorization on Multi-Core Vector Accelerator

Efficient Large-Scale 1D FFT Vectorization on Multi-Core Vector Accelerator

The Matrix2 Accelerator is a high-performance multi-core vector processor for high-density computing that supports fused multiply-add instructions. We propose an efficient large-scale 1D FFT vectorization method according to the architecture characteristics of Matrix2. (1) An FFT vectorization method based on fused multiply-add instruction is proposed to accelerate FFT computation. By transforming the operation flow of FFT butterfly computation, the independent multiplication and addition operations in the traditional FFT computation method are combined into a smaller number of fused multiply-add operations. It reduces the number of the real floating-point operations in radix-2 FFT butterfly computation from original 10 multiplication/addition operations to 6 fused multiply-add operations, and reduces the number of the real floating-point operations in radix-4 FFT butterfly computation from original 34 multiplication/addition operations to 24 fused multiply-add operations. (2) An FFT vectorization method based on matrix Fourier algorithm is designed, which converts 1D FFT computation into 2D FFT computation. It contains three steps: column FFT computation, multiplication of the column FFT computation result and a factor matrix, row FFT computation. These three steps are all vectorized. (3) A factor matrix data layout and updating method is proposed, which can greatly reduce the memory capacity for factor matrix. It can avoid multiple data transmissions between the vector array memory and the global cache by combining the column FFT computation with the factor matrix multiplication, thus significantly improving the computational efficiency of FFT. (4) A double buffering DMA mechanism is adopted to optimize and smooth the data transmission between the multi-level storage structures, and the data transmission time is overlapped with the computation time so as to reduce the total computation time. The experimental results on Matrix2 show that the proposed vectorization method improves the computational efficiency of large-scale 1D FFT by an average of 5.56 times.

Xiaowen Chen | Zhong Liu | Yuanwu Lei | Man Liao | Xi Tian

[1] Stefan Goedecker,et al. Fast Radix 2, 3, 4, and 5 Kernels for Fast Fourier Transformations on Computers with Overlapping Multiply-Add Instructions , 1997, SIAM J. Sci. Comput..

[2] Margarita Amor,et al. Influence of memory access patterns to small-scale FFT performance , 2012, The Journal of Supercomputing.

[3] Yiqun Liu,et al. MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs , 2013, Journal of Computer Science and Technology.

[4] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .

[5] Christoph W. Ueberhuber,et al. MULTIPLY-ADD OPTIMIZED FFT KERNELS , 2001 .

[6] Ephraim Feig,et al. Implementation of Efficient FFT Algorithms on Fused Multiply- Add Architectures , 1993, IEEE Trans. Signal Process..

[7] Steven G. Johnson,et al. The Fastest Fourier Transform in the West , 1997 .

[8] Markus Püschel,et al. Mechanical Derivation of Fused Multiply–Add Algorithms for Linear Transforms , 2007, IEEE Transactions on Signal Processing.

[9] Liu Zhon. Vectorization of accelerating fast Fourier transform computation based on fused multiply-add instruction , 2015 .