Parallel lossless data compression on the GPU

We present parallel algorithms and implementations of a bzip2-like lossless data compression scheme for GPU architectures. Our approach parallelizes three main stages in the bzip2 compression pipeline: Burrows-Wheeler transform (BWT), move-to-front transform (MTF), and Huffman coding. In particular, we utilize a two-level hierarchical sort for BWT, design a novel scan-based parallel MTF algorithm, and implement a parallel reduction scheme to build the Huffman tree. For each algorithm, we perform detailed performance analysis, discuss its strengths and weaknesses, and suggest future directions for improvements. Overall, our GPU implementation is dominated by BWT performance and is 2.78× slower than bzip2, with BWT and MTF-Huffman respectively 2.89× and 1.34× slower on average.

[1]  Anton V. Pereberin Hierarchical Approach for Texture Compression , 2007 .

[2]  Givon Zirkind,et al.  AFIS data compression: an example of how domain specific compression algorithms can produce very high compression ratios , 2007, SOEN.

[3]  Andrew A. Davidson,et al.  Efficient parallel merge sort for fixed and variable length keys , 2012, 2012 Innovative Parallel Computing (InPar).

[4]  Martin Burtscher,et al.  Floating-point data compression at 75 Gb/s on a GPU , 2011, GPGPU-4.

[5]  Ugo Erra Toward Real Time Fractal Image Compression Using Graphics Hardware , 2005, ISVC.

[6]  Andrew S. Grimshaw,et al.  Parallel Scan for Stream Architectures , 2012 .

[7]  M. Schindler,et al.  A fast block-sorting algorithm for lossless data compression , 1997, Proceedings DCC '97. Data Compression Conference.

[8]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[9]  Zheng Wei,et al.  Optimization of linked list prefix computations on multithreaded GPUs using CUDA , 2010, IPDPS.

[10]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[11]  Anthony Skjellum,et al.  Accelerating Lossless Data Compression with GPUs , 2011, ArXiv.

[12]  Andrew S. Grimshaw,et al.  Revisiting sorting for GPGPU stream architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[14]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[15]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[16]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[17]  Julian Seward On the performance of BWT sorting algorithms , 2000, Proceedings DCC 2000. Data Compression Conference.

[18]  Lawrence Lau,et al.  Parallel Run Length Encoding Compression: Reducing I/o in dYnamic Environmental Simulations , 1998, Int. J. High Perform. Comput. Appl..