Massively Parallel ANS Decoding on GPUs

In recent years, graphics processors have enabled significant advances in the fields of big data and streamed deep learning. In order to keep control of rapidly growing amounts of data and to achieve sufficient throughput rates, compression features are a key part of many applications including popular deep learning pipelines. However, as most of the respective APIs rely on CPU-based preprocessing for decoding, data decompression frequently becomes a bottleneck in accelerated compute systems. This establishes the need for efficient GPU-based solutions for decompression. Asymmetric numeral systems (ANS) represent a modern approach to entropy coding, combining superior compression results with high compression and decompression speeds. Concepts for parallelizing ANS decompression on GPUs have been published recently. However, they only exhibit limited scalability in practical applications. In this paper, we present the first massively parallel, arbitrarily scalable approach to ANS decoding on GPUs, based on a novel overflow pattern. Our performance evaluation on three different CUDA-enabled GPUs (V100, TITAN V, GTX 1080) demonstrates speedups of up to 17 over 64 CPU threads, up to 31 over a high performance SIMD-based solution, and up to 39 over Zstandard's entropy codec. Our implementation is publicly available at https://github.com/weissenberger/multians.

[1]  Dinesh Manocha,et al.  GST , 2016 .

[2]  D. Martin Swany,et al.  CULZSS: LZSS Lossless Data Compression on CUDA , 2011, 2011 IEEE International Conference on Cluster Computing.

[3]  Bertil Schmidt,et al.  Massively Parallel Huffman Decoding on GPUs , 2018, ICPP.

[4]  Sven Simon,et al.  An Architecture for Asymmetric Numeral Systems Entropy Decoder - A Comparison with a Canonical Huffman Decoder , 2019, J. Signal Process. Syst..

[5]  Fabian Giesen,et al.  Interleaved entropy coders , 2014, ArXiv.

[6]  Yao Zhang,et al.  Parallel lossless data compression on the GPU , 2012, 2012 Innovative Parallel Computing (InPar).

[7]  Shmuel Tomi Klein,et al.  Parallel Huffman Decoding with Applications to JPEG Files , 2003, Comput. J..

[8]  Edward J. Delp,et al.  The use of asymmetric numeral systems as an accurate replacement for Huffman coding , 2015, 2015 Picture Coding Symposium (PCS).

[9]  Koji Nakano,et al.  Light Loss-Less Data Compression, with GPU Implementation , 2016, ICA3PP.

[10]  Kenneth A. Ross,et al.  Massively-Parallel Lossless Data Decompression , 2016, 2016 45th International Conference on Parallel Processing (ICPP).