SPLG: A Tuned Signal Processing Library for GPU Architectures

In order to increase the efficiency of existing software many works are incorporating GPU processing. However, despite the current advances in GPU languages and tools, taking advantage of their parallel architecture is still far more complex than programming standard multi-core CPUs. Performance profiling and analysis of known applications provides a useful insight of the hardware architecture and memory hierarchy. Afterwards, this analysis can be used to identify potential bottlenecks and tune other software so it can make a more efficient usage of the available resources. In this work we implement a small signal processing library which will be used to characterize the performance of most recent NVIDIA GPU architectures. The methodology used in our signal processing library is based on a series of building blocks that enable us to easily design several well-known algorithms with little effort. The library was built paying special attention to flexibility and adaptability. In this work we also show how a generic approach can be used to easily design these GPU algorithms while obtaining competitive performance, which results specially interesting from the productivity standpoint.

[1]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[2]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[3]  Massimo Panella,et al.  An Efficient GPU Implementation of Modified Discrete Cosine Transform Using CUDA , 2012 .

[4]  Margarita Amor,et al.  Influence of memory access patterns to small-scale FFT performance , 2012, The Journal of Supercomputing.

[5]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[6]  Henrietta Hardy Mrs Hammond The complete guide , 1883 .

[7]  Naga K. Govindaraju,et al.  Auto-tuning of fast fourier transform on graphics processors , 2011, PPoPP '11.

[8]  Thomas G. Stockham,et al.  High-speed convolution and correlation , 1966, AFIPS '66 (Spring).

[9]  R. Hartley A More Symmetrical Fourier Analysis Applied to Transmission Problems , 1942, Proceedings of the IRE.

[10]  Yifeng Chen,et al.  Large-scale FFT on GPU clusters , 2010, ICS '10.

[11]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[12]  Satoshi Matsuoka,et al.  High performance 3-D FFT using multiple CUDA GPUs , 2012, GPGPU-5.

[13]  Yiqun Liu,et al.  MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs , 2013, Journal of Computer Science and Technology.

[14]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[15]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[16]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[17]  K. R. Rao,et al.  The Transform and Data Compression Handbook , 2000 .

[18]  Keith Jones The Regularized Fast Hartley Transform , 2010 .