A Novel Minimum Time Parallel 2-D Discrete Wavelet Transform Algorithm for General Purpose Processors

A novel efficient inplace, multithreaded, and cachefriendly parallel 2-D wavelet transform algorithm based on the lifting transform is introduced. In order to maximize the cache utilization and consequently minimize the memory bus bandwidth use, the threads compete to work on a small memory area maximizing the chance of finding it in the cache and their synchronization is done with very low overhead without the use of any locks and relying solely on the basic compare-and-swap (CAS) atomic primitive. An implementation in the C programming language with and without the use of vector (single instruction multiple data - SIMD) instructions is provided for both single (serial) and multi (parallel) threaded single-loop DWT implementations as well as serial and parallel naive implementations using linear (row order) and strided (column order) memory access patterns for comparison. Results show a significant improvement over the single-threaded optimized implementation and a much greater improvement over both the single and multi threaded naive implementations, reaching minimum running time depending on the number of processor cores and the available memory bus bandwidth, i.e., it becomes memory bound using the minimum number of memory accesses. Given the simplicity and high speed of the lifting steps, an analysis based on the number of memory bus operations (read and write) is done for images that are larger than twice the shared cache size which establishes a lower bound for the running time of all linear memory access algorithms and also determines the maximum speed gains to be expected in relation to currently implemented parallel schemes based on the parallel execution of independent lifting steps. It also shows the optimality of the parallel algorithm presented. Finally, a comparison with currently available implementations shows the gains achieved by the proposed algorithm.

[1]  Jos B. T. M. Roerdink,et al.  Accelerating Wavelet Lifting on Graphics Hardware Using CUDA , 2011, IEEE Transactions on Parallel and Distributed Systems.

[2]  Michael W. Marcellin,et al.  JPEG2000 - image compression fundamentals, standards and practice , 2013, The Kluwer international series in engineering and computer science.

[3]  Pavel Zemcík,et al.  Parallel wavelet schemes for images , 2016, Journal of Real-Time Image Processing.

[4]  Francisco Tirado Fernández,et al.  2-D wavelet transform enhancement on general-purpose microprocessors: memory hierarchy and SIMD parallelism exploitation , 2002 .

[5]  I. Daubechies,et al.  Factoring wavelet transforms into lifting steps , 1998 .

[6]  Bo-Cheng Lai,et al.  Self adaptable multithreaded object detection on embedded multicore systems , 2015, J. Parallel Distributed Comput..

[7]  Fumihiko Ino,et al.  Reducing memory usage by the lifting-based discrete wavelet transform with a unified buffer on a GPU , 2016, J. Parallel Distributed Comput..

[8]  Pavel Zemcík,et al.  Single-Loop Software Architecture for JPEG 2000 , 2016, 2016 Data Compression Conference (DCC).

[9]  Marcos Martínez Peiró,et al.  Flexible architecture for the implementation of the two-dimensional discrete wavelet transform (2D-DWT) oriented to FPGA devices , 2004, Microprocess. Microsystems.

[10]  Jianhua Hou,et al.  Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms , 2007, Signal Process..

[11]  Paul E. McKenney Memory ordering in modern microprocessors, Part I , 2005 .

[12]  Francisco Tirado,et al.  -D Wavelet Transform Enhancement on General-Purpose Microprocessors: Memory Hierarchy and SIMD Parallelism Exploitation , 2002, HiPC.

[13]  Pavel Zemcík,et al.  Minimum Memory Vectorisation of Wavelet Lifting , 2013, ACIVS.

[14]  Rade Kutil A single-loop approach to SIMD parallelization of 2D wavelet lifting , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[15]  Ana Lucia Varbanescu,et al.  On the effective parallel programming of multi-core processors , 2010 .

[16]  Michael W. Marcellin,et al.  JPEG2000 - image compression fundamentals, standards and practice , 2002, The Kluwer International Series in Engineering and Computer Science.

[17]  Francisco Tirado Fernández,et al.  Wavelet transform for large scale image processing on modern microprocessors , 2003 .

[18]  Alex Hutcheson,et al.  Memory Bound vs . Compute Bound : A Quantitative Study of Cache and Memory Bandwidth in High Performance Applications , 2011 .

[19]  Stamatis Vassiliadis,et al.  Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors , 2008, IEEE Transactions on Multimedia.

[20]  Pavel Zemcík,et al.  Vectorization and parallelization of 2-D wavelet lifting , 2015, Journal of Real-Time Image Processing.

[21]  David Ba°ina,et al.  Lifting Scheme Cores for Wavelet Transform , 2015 .