论文信息 - Transformer-based Image Compression

Transformer-based Image Compression

A Transformer-based Image Compression (TIC) approach is developed which reuses the canonical variational autoencoder (VAE) architecture with paired main and hyper encoderdecoders. Both main and hyper encoders are comprised of a sequence of neural transformation units (NTUs) to analyse and aggregate important information for more compact representation of input image, while the decoders mirror the encoder-side operations to generate pixel-domain image reconstruction from the compressed bitstream. Each NTU is consist of a Swin Transformer Block (STB) and a convolutional layer (Conv) to best embed both long-range and short-range information; In the meantime, a casual attention module (CAM) is devised for adaptive context modeling of latent features to utilize both hyper and autoregressive priors. The TIC rivals with state-of-the-art approaches including deep convolutional neural networks (CNNs) based learnt image coding (LIC) methods and handcrafted rules-based intra profile of recently-approved Versatile Video Coding (VVC) standard, and requires much less model parameters, e.g., up to 45% reduction to leading-performance LIC.

[1] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[2] Xi Chen,et al. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[3] Li Chen,et al. An End-to-End Learning Framework for Video Compression , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Gary J. Sullivan,et al. Overview of the Versatile Video Coding (VVC) Standard and its Applications , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[5] Zhan Ma,et al. End-to-End Learnt Image Compression via Non-Local Attention Optimization and Improved Context Modeling , 2021, IEEE Transactions on Image Processing.

[6] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[7] Gregory K. Wallace,et al. The JPEG still picture compression standard , 1991, CACM.

[8] Luc Van Gool,et al. SwinIR: Image Restoration Using Swin Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[9] Valero Laparra,et al. End-to-end Optimized Image Compression , 2016, ICLR.

[10] David Minnen,et al. Joint Autoregressive and Hierarchical Priors for Learned Image Compression , 2018, NeurIPS.

[11] Jiro Katto,et al. Learned Image Compression With Discretized Gaussian Mixture Likelihoods and Attention Modules , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] David Minnen,et al. Variational image compression with a scale hyperprior , 2018, ICLR.

[13] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[15] Jianmin Bao,et al. Uformer: A General U-Shaped Transformer for Image Restoration , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Michael W. Marcellin,et al. JPEG2000 - image compression fundamentals, standards and practice , 2013, The Kluwer international series in engineering and computer science.

[17] D. Hubel,et al. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[18] Xiaojie Jin,et al. DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[19] Trevor Darrell,et al. Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[20] Akshay Pushparaja,et al. CompressAI: a PyTorch library and evaluation platform for end-to-end compression research , 2020, ArXiv.

[21] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[22] Dong Xu,et al. A Unified End-to-End Framework for Efficient Deep Image Compression , 2020, ArXiv.