Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
暂无分享,去创建一个
Satoshi Matsuoka | Toshio Endo | Akira Nukada | Yasuhiko Ogata | Toshio Endo | S. Matsuoka | A. Nukada | Y. Ogata
[1] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .
[2] J. J. Lambiotte,et al. Computing the Fast Fourier Transform on a vector computer , 1979 .
[3] Paul N. Swarztrauber,et al. FFT algorithms for vector computers , 1984, Parallel Comput..
[4] C. Loan. Computational Frameworks for the Fast Fourier Transform , 1992 .
[5] S. Goedecker. Rotating a three-dimensional array in an optimal position for vector processing: case study for a three-dimensional fast Fourier transform , 1993 .
[6] Ramesh C. Agarwal,et al. An efficient parallel algorithm for the 3-D FFT NAS parallel benchmark , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.
[7] Ramesh C. Agarwal,et al. A high performance parallel algorithm for 1-D FFT , 1994, Proceedings of Supercomputing '94.
[8] Markus Hegland. Real and Complex Fast Fourier Transforms on the Fujitsu VPP 500 , 1996, Parallel Comput..
[9] David K. McAllister,et al. Fast Matrix Multiplies Using Graphics Hardware , 2001, ACM/IEEE SC 2001 Conference (SC'01).
[10] Mitsuo Yokokawa,et al. 16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[11] Kenneth Moreland,et al. The FFT on a GPU , 2003, HWWS '03.
[12] Daisuke Takahashi. Efficient implementation of parallel three-dimensional FFT on clusters of PCs , 2003 .
[13] William R. Mark,et al. Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..
[14] Z. Weng,et al. ZDOCK: An initial‐stage protein‐docking algorithm , 2003, Proteins.
[15] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.
[16] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[17] N.K. Govindaraju,et al. A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[18] David Tarditi,et al. Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.
[19] Dawid Pajak. General-Purpose Computation Using Graphics Hardware for Fast HDR Image Processing , 2007 .
[20] Mark J. Stock,et al. Toward efficient GPU-accelerated N-body simulations , 2008 .
[21] Satoshi Matsuoka,et al. An efficient, model-based CPU-GPU heterogeneous FFT library , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[22] Emilio L. Zapata,et al. Memory Locality Exploitation Strategies for FFT on the CUDA Architecture , 2008, VECPAR.
[23] Wen-mei W. Hwu,et al. Compute Unified Device Architecture Application Suitability , 2009, Computing in Science & Engineering.