A memory bandwidth improvement with memory space partitioning for single-precision floating-point FFT on Stratix 10 FPGA
暂无分享,去创建一个
The Fast Fourier Transform (FFT) is one of the fundamental computational methods used in the fields of computational science and high-performance computing. Single-precision floating-point complex FFT itself is known as a memory bandwidth bottleneck and often becomes a bottleneck of application acceleration in these fields. We are researching and developing a parallel FFT on FPGA(s) to overcome this problem. In this paper, we discuss the memory bandwidth of the single-precision floating-point complex FFT on an FPGA. Our FFT implementation is based on a state-of-the-art OpenCL implementation provided by Intel. We first show that the computational performance of the FFT on Intel PAC D5005 is proportional to the effective memory bandwidth of the main memory. Then we propose a memory sub-system to improve the effective memory bandwidth. Specifically, a memory space partitioning and the sub-modules that access each memory space individually. In our FPGA design running at 270 MHz, two memory channels of DDR4-2400 memory are used for both reading and writing, respectively. Our proposed memory sub-system achieved an effective memory bandwidth of 22.57 [GB/s] (65.3% of the theoretical peak of this implementation) was achieved when the number of data points for FFT was 16,777,216.