Image Processing on Multicore x86 Architectures

As multicore architectures overtake single-core architectures in today's and future compute systems, traditional applications with sequential algorithms can no longer rely on technology scaling to improve performance. Instead, applications must switch to parallel algorithms to take advantage of multicore system performance. Image processing applications exhibit a high degree of parallelism and are excellent candidates for multicore systems. However, simply exploiting parallelism is not enough to achieve the best performance. Optimization must take into account underlying architecture characteristics such as wide vector and limited bandwidth. This article illustrates techniques that can be used to optimize performance for multicore x86 systems on three key image processing kernels: fast Fourier transform, convolution, and histogram.

[1]  S. Lennart Johnsson,et al.  Scheduling FFT computation on SMP and multicore systems , 2007, ICS '07.

[2]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[3]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[4]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[5]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[6]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[7]  William J. Dally,et al.  A tuning framework for software-managed memory hierarchies , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[9]  R. W. Johnson,et al.  A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures , 1990 .

[10]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[11]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[12]  Seyed H. Roosta Principles of Parallel Programming , 2000 .

[13]  G. Blake,et al.  A survey of multicore processors , 2009, IEEE Signal Processing Magazine.

[14]  Franz Franchetti,et al.  Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.

[15]  Franz Franchetti,et al.  Short vector code generation for the discrete Fourier transform , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[16]  David H. Bailey A High-Performance FFT Algorithm for Vector Supercomputers , 1987, PPSC.