Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is four-fold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multi-threaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0X and 6.8X on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3X over the multi-thread encoder on two 28-core Xeon Platinum 8280 CPUs.

[1]  Peter Deutsch,et al.  GZIP file format specification version 4.3 , 1996, RFC.

[2]  Franck Cappello,et al.  Significantly improving lossy compression quality based on an optimized hybrid prediction model , 2019, SC.

[3]  Yi Yang,et al.  Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[4]  David A. Bader,et al.  GPU merge path: a GPU merging algorithm , 2012, ICS '12.

[5]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[6]  José Ignacio Benavides Benítez,et al.  An optimized approach to histogram computation on GPU , 2012, Machine Vision and Applications.

[7]  Koen Bertels,et al.  A Two-phase Practical Parallel Algorithm for Construction of Huffman Codes , 2007, PDPTA.

[8]  Ruy Luiz Milidiú,et al.  A work-efficient parallel algorithm for constructing Huffman codes , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[9]  Franck Cappello,et al.  Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Wang Liang,et al.  Segmenting DNA sequence into 'words' based on statistical language model , 2012, ArXiv.

[11]  Ben H. H. Juurlink,et al.  E^2MC: Entropy Encoding Based Memory Compression for GPUs , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[12]  Javier Castillo,et al.  Accelerating huffman decoding of seismic data on GPUs , 2015, 2015 20th Symposium on Signal Processing, Images and Computer Vision (STSIVA).

[13]  Eugene S. Schwartz,et al.  Generating a canonical prefix encoding , 1964, CACM.

[14]  Franck Cappello,et al.  Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[15]  Ahsan Habib,et al.  Balancing decoding speed and memory usage for Huffman codes using quaternary tree , 2016, Applied Informatics.

[16]  Bertil Schmidt,et al.  Massively Parallel Huffman Decoding on GPUs , 2018, ICPP.

[17]  Lawrence L. Larmore,et al.  Constructing Huffman Trees in Parallel , 1995, SIAM J. Comput..

[18]  Franck Cappello,et al.  Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[19]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[20]  Franck Cappello,et al.  Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization , 2020, HPDC.

[21]  Peter Lindstrom,et al.  Fixed-Rate Compressed Floating-Point Arrays , 2014, IEEE Transactions on Visualization and Computer Graphics.

[22]  Václav Snásel,et al.  n-Gram-Based Text Compression , 2016, Comput. Intell. Neurosci..

[23]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[24]  Franck Cappello,et al.  cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data , 2020, PACT.

[25]  Cuneyt Akinlar,et al.  A parallel Huffman coder on the CUDA architecture , 2014, 2014 IEEE Visual Communications and Image Processing Conference.

[26]  Jan Lucas,et al.  E^2MC: Entropy Encoding Based Memory Compression for GPUs , 2017, IPDPS 2017.

[27]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1991, CACM.

[28]  David W. Nellans,et al.  Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[29]  Franck Cappello,et al.  Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[30]  Franck Cappello,et al.  Use cases of lossy compression for floating-point data in scientific data sets , 2019, Int. J. High Perform. Comput. Appl..