Auto-tuning of the FFTW Library for Massively Parallel Supercomputers
暂无分享,去创建一个
In this paper we present the work carried out by CINECA in the framework of the PRACE-2IP project which had the aim of improving the performance of the FFTW library by refining the auto-tuning mechanism that is already implemented in this library. This optimization was realized with the following activities: Identification of the major bottlenecks present in the current FFTW implementation; Investigation of the auto-tuning mechanism provided in FFTW in order to understand how performance is affected by domain decomposition; Introduction of a new parallel domain decomposition; Construction of a library to improve the performance of the auto-tuning mechanism. In particular, we have compared the performance of the standard Slab Decomposition algorithm already present with that obtained using the 2D Domain Decomposition and we found that on massively parallel supercomputers the performance of this new algorithm is significantly higher.
[1] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[2] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .
[3] Ning Li,et al. 2DECOMP&FFT - A Highly Scalable 2D Decomposition Library and FFT Interface , 2010 .