Towards End-to-End Image Compression and Analysis with Transformers

We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer. Specifically, we first replace the patchify stem (i.e., image splitting and embedding) of the ViT model with a lightweight image encoder modelled by a convolutional neural network. The compressed features generated by the image encoder are injected convolutional inductive bias and are fed to the Transformer for image classification bypassing image reconstruction. Meanwhile, we propose a feature aggregation module to fuse the compressed features with the selected intermediate features of the Transformer, and feed the aggregated features to a deconvolutional neural network for image reconstruction. The aggregated features can obtain the long-term information from the self-attention mechanism of the Transformer and improve the compression performance. The rate-distortion-accuracy optimization problem is finally solved by a two-step training strategy. Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.

[1]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[2]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[3]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1991, CACM.

[4]  Lucas Theis,et al.  Lossy Image Compression with Compressive Autoencoders , 2017, ICLR.

[5]  Longhui Wei,et al.  Visformer: The Vision-friendly Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Zhan Ma,et al.  Learning End-to-End Lossy Image Compression: A Benchmark , 2021, IEEE transactions on pattern analysis and machine intelligence.

[7]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[8]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[9]  David Minnen,et al.  Variable Rate Image Compression with Recurrent Neural Networks , 2015, ICLR.

[10]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[11]  Fengwei Yu,et al.  Incorporating Convolution Designs into Visual Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Trevor Darrell,et al.  Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[13]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[14]  David Minnen,et al.  Joint Autoregressive and Hierarchical Priors for Learned Image Compression , 2018, NeurIPS.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Zhi Ding,et al.  Quannet: Joint Image Compression and Classification Over Channels with Limited Bandwidth , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[17]  Wen Gao,et al.  Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics , 2020, IEEE Transactions on Image Processing.

[18]  Rong Jin,et al.  Learning Accurate Entropy Model with Global Reference for Image Compression , 2021, ICLR.

[19]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[22]  Wenhan Yang,et al.  Coarse-to-Fine Hyper-Prior Modeling for Learned Image Compression , 2020, AAAI.

[23]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[24]  Jiyang Qi,et al.  You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection , 2021, NeurIPS.

[25]  Jiro Katto,et al.  Learned Image Compression With Discretized Gaussian Mixture Likelihoods and Attention Modules , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Zhibo Chen,et al.  Causal Contextual Prediction for Learned Image Compression , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  David Minnen,et al.  Variational image compression with a scale hyperprior , 2018, ICLR.

[29]  Touradj Ebrahimi,et al.  The JPEG 2000 still image compression standard , 2001, IEEE Signal Process. Mag..

[30]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Jarek Duda,et al.  Asymmetric numeral systems , 2009, ArXiv.

[34]  Luc Van Gool,et al.  Towards Image Understanding from Deep Compression without Decoding , 2018, ICLR.

[35]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[36]  Valero Laparra,et al.  Density Modeling of Images using a Generalized Normalization Transformation , 2015, ICLR.

[37]  Esa Rahtu,et al.  Image Coding For Machines: an End-To-End Learned Approach , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Bohyung Han,et al.  Task-Aware Quantization Network for JPEG Image Compression , 2020, ECCV.

[39]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[40]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  David Zhang,et al.  Learning Convolutional Networks for Content-Weighted Image Compression , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[44]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[45]  Fabien Racapé,et al.  End-to-End optimized image compression for machines, a study , 2020, 2021 Data Compression Conference (DCC).

[46]  Michael Elad,et al.  The Rate-Distortion-Accuracy Tradeoff: JPEG Case Study , 2020, 2021 Data Compression Conference (DCC).