Efficient Large-Scale 1D FFT Vectorization on Multi-Core Vector Accelerator

The Matrix2 Accelerator is a high-performance multi-core vector processor for high-density computing that supports fused multiply-add instructions. We propose an efficient large-scale 1D FFT vectorization method according to the architecture characteristics of Matrix2. (1) An FFT vectorization method based on fused multiply-add instruction is proposed to accelerate FFT computation. By transforming the operation flow of FFT butterfly computation, the independent multiplication and addition operations in the traditional FFT computation method are combined into a smaller number of fused multiply-add operations. It reduces the number of the real floating-point operations in radix-2 FFT butterfly computation from original 10 multiplication/addition operations to 6 fused multiply-add operations, and reduces the number of the real floating-point operations in radix-4 FFT butterfly computation from original 34 multiplication/addition operations to 24 fused multiply-add operations. (2) An FFT vectorization method based on matrix Fourier algorithm is designed, which converts 1D FFT computation into 2D FFT computation. It contains three steps: column FFT computation, multiplication of the column FFT computation result and a factor matrix, row FFT computation. These three steps are all vectorized. (3) A factor matrix data layout and updating method is proposed, which can greatly reduce the memory capacity for factor matrix. It can avoid multiple data transmissions between the vector array memory and the global cache by combining the column FFT computation with the factor matrix multiplication, thus significantly improving the computational efficiency of FFT. (4) A double buffering DMA mechanism is adopted to optimize and smooth the data transmission between the multi-level storage structures, and the data transmission time is overlapped with the computation time so as to reduce the total computation time. The experimental results on Matrix2 show that the proposed vectorization method improves the computational efficiency of large-scale 1D FFT by an average of 5.56 times.