Parallel design of JPEG-LS encoder on graphics processing units

Abstract. With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64 × 64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.

[1]  Leonel Sousa,et al.  Massive parallel LDPC decoding on GPU , 2008, PPoPP.

[2]  Constantinos E. Goutis,et al.  Efficient high-performance implementation of JPEG-LS encoder , 2008, Journal of Real-Time Image Processing.

[3]  Guillermo Sapiro,et al.  The LOCO-I lossless image compression algorithm: principles and standardization into JPEG-LS , 2000, IEEE Trans. Image Process..

[4]  Firas Hamze,et al.  A Performance Comparison of CUDA and OpenCL , 2010, ArXiv.

[5]  Marco Ferretti,et al.  A parallel pipelined implementation of LOCO-I for JPEG-LS , 1988, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[6]  Andreas E. Savakis,et al.  Benchmarking and hardware implementation of JPEG-LS , 2002, Proceedings. International Conference on Image Processing.

[7]  Guillermo Sapiro,et al.  From LOGO-I to the JPEG-LS standard , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[8]  Antonio Abramo,et al.  A Fully Pipelined Architecture for the LOCO-I Compression Algorithm , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9]  Fumihiko Ino,et al.  Accelerating cone beam reconstruction using the CUDA-enabled GPU , 2008, HiPC'08.

[10]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[11]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[12]  Jessica A. Faust,et al.  Imaging Spectroscopy and the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) , 1998 .

[13]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[14]  M. Klimesh,et al.  Hardware Implementation of a Lossless Image Compression Algorithm Using a Field Programmable Gate Array , 2000 .

[15]  Zhe Wang,et al.  Memory-efficient parallelization of JPEG-LS with relaxed context update , 2010, 28th Picture Coding Symposium.

[16]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[17]  Guillermo Sapiro,et al.  LOCO-I: a low complexity, context-based, lossless image compression algorithm , 1996, Proceedings of Data Compression Conference - DCC '96.