STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement

Image and video enhancement such as color constancy, low light enhancement, and tone mapping on smartphones is challenging, because high-quality images should be achieved efficiently with a limited resource budget. Unlike prior works that either used very deep CNNs or large Trans-former models, we propose a structure-aware lightweight Transformer, termed STAR, for real-time image enhancement. STAR is formulated to capture long-range dependencies between image patches, which naturally and implicitly captures the structural relationships of different regions in an image. STAR is a general architecture that can be easily adapted to different image enhancement tasks. Extensive experiments show that STAR can effectively boost the quality and efficiency of many tasks such as illumination enhancement, auto white balance, and photo retouching, which are indispensable components for image processing on smartphones. For example, STAR reduces model complexity and improves image quality compared to the recent state-of-the-art [19] on the MIT-Adobe FiveK dataset [7] (i.e., 1.8dB PSNR improvements with 25% parameters and 13% float operations.)

[1]  Francis E. H. Tay,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[3]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[6]  Lei Zhang,et al.  Learning Image-Adaptive 3D Lookup Tables for High Performance Photo Enhancement in Real-Time , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Shuicheng Yan,et al.  ConvBERT: Improving BERT with Span-based Dynamic Convolution , 2020, NeurIPS.

[8]  Yair Movshovitz-Attias,et al.  Sky Optimization: Semantically aware image processing of skies in low-light photography , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Yue Wang,et al.  VD-BERT: A Unified Vision and Dialog Transformer with BERT , 2020, EMNLP.

[11]  Song Han,et al.  Lite Transformer with Long-Short Range Attention , 2020, ICLR.

[12]  Michael S. Brown,et al.  Deep White-Balance Editing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Chen Change Loy,et al.  Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Toshihiko Yamasaki,et al.  Unpaired Image Enhancement Featuring Reinforcement-Learning-Controlled Image Editing Software , 2019, AAAI.

[15]  Luc Van Gool,et al.  Self-Guided Network for Fast Image Denoising , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[17]  Lei Zhang,et al.  CameraNet: A Two-Stage Framework for Effective Camera ISP Learning , 2019, IEEE Transactions on Image Processing.

[18]  Ding Liu,et al.  EnlightenGAN: Deep Light Enhancement Without Paired Supervision , 2019, IEEE Transactions on Image Processing.

[19]  Chi-Wing Fu,et al.  Underexposed Photo Enhancement Using Deep Illumination Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Michael S. Brown,et al.  When Color Constancy Goes Wrong: Correcting Improperly White-Balanced Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Gabriele Facciolo,et al.  Joint Demosaicking and Denoising by Fine-Tuning of Bursts of Raw Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Chen Wei,et al.  Deep Retinex Decomposition for Low-Light Enhancement , 2018, BMVC.

[23]  Yung-Yu Chuang,et al.  Deep Photo Enhancer: Unpaired Learning for Image Enhancement from Photographs with GANs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Wei Zhang,et al.  Boosting up Scene Text Detectors with Guided CNN , 2018, BMVC.

[25]  In-So Kweon,et al.  Distort-and-Recover: Color Enhancement Using Deep Reinforcement Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Mahmoud Afifi,et al.  Semantic White Balance: Semantic Color Constancy Using Convolutional Neural Network , 2018, ArXiv.

[27]  Raja Giryes,et al.  DeepISP: Toward Learning an End-to-End Image Processing Pipeline , 2018, IEEE Transactions on Image Processing.

[28]  Sven Loncaric,et al.  Unsupervised Learning for Color Constancy , 2017, VISIGRAPP.

[29]  Hao He,et al.  Exposure , 2017, ACM Trans. Graph..

[30]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Seonghyeon Nam,et al.  Modelling the Scene Dependent Imaging in Cameras with a Deep Neural Network , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Ravi Ramamoorthi,et al.  Deep high dynamic range imaging of dynamic scenes , 2017, ACM Trans. Graph..

[33]  Jonathan T. Barron,et al.  Deep bilateral learning for real-time image enhancement , 2017, ACM Trans. Graph..

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Yu Li,et al.  LIME: Low-Light Image Enhancement via Illumination Map Estimation , 2017, IEEE Transactions on Image Processing.

[36]  Jonathan T. Barron,et al.  Burst photography for high dynamic range and low-light imaging on mobile cameras , 2016, ACM Trans. Graph..

[37]  Xiao-Ping Zhang,et al.  A Weighted Variational Model for Simultaneous Reflectance and Illumination Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.

[39]  Yizhou Yu,et al.  Automatic Photo Adjustment Using Deep Neural Networks , 2014, ACM Trans. Graph..

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Eli Shechtman,et al.  Patch-based high dynamic range video , 2013, ACM Trans. Graph..

[42]  Eli Shechtman,et al.  Robust patch-based hdr reconstruction of dynamic scenes , 2012, ACM Trans. Graph..

[43]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  William T. Freeman,et al.  The patch transform and its applications to image editing , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Alexei A. Efros,et al.  Fast bilateral filtering for the display of high-dynamic-range images , 2002 .

[46]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[47]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[48]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[49]  Wangmeng Zuo,et al.  COLOR IMAGE DEMOSAICKING VIA DEEP RESIDUAL LEARNING , 2017 .