HIPA: Hierarchical Patch Transformer for Single Image Super Resolution

Transformer-based architectures start to emerge in single image super resolution (SISR) and have achieved promising performance. However, most existing vision Transformer-based SISR methods still have two shortcomings: (1) they divide images into the same number of patches with a fixed size, which may not be optimal for restoring patches with different levels of texture richness; and (2) their position encodings treat all input tokens equally and hence, neglect the dependencies among them. This paper presents a HIPA, which stands for a novel Transformer architecture that progressively recovers the high resolution image using a hierarchical patch partition. Specifically, we build a cascaded model that processes an input image in multiple stages, where we start with tokens with small patch sizes and gradually merge them to form the full resolution. Such a hierarchical patch mechanism not only explicitly enables feature aggregation at multiple resolutions but also adaptively learns patch-aware features for different image regions, e.g., using a smaller patch for areas with fine details and a larger patch for textureless regions. Meanwhile, a new attention-based position encoding scheme for Transformer is proposed to let the network focus on which tokens should be paid more attention by assigning different weights to different tokens, which is the first time to our best knowledge. Furthermore, we also propose a multi-receptive field attention module to enlarge the convolution receptive field from different branches. The experimental results on several public datasets demonstrate the superior performance of the proposed HIPA over previous methods quantitatively and qualitatively. We will share our code and models when the paper is accepted.

[1]  W. Dong,et al.  Deep Gaussian Scale Mixture Prior for Image Reconstruction , 2023, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Marcos V. Conde,et al.  Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration , 2022, ECCV Workshops.

[3]  Kun Zeng,et al.  A Hybrid Network of CNN and Transformer for Lightweight Image Super-Resolution , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[4]  Yapeng Tian,et al.  Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Lei Zhang,et al.  Efficient Long-Range Attention Network for Image Super-resolution , 2022, ECCV.

[6]  Dafan Zhang,et al.  TDPN: Texture and Detail-Preserving Network for Single Image Super-Resolution , 2022, IEEE Transactions on Image Processing.

[7]  Xiaochun Cao,et al.  Video Super-Resolution via a Spatio-Temporal Alignment Network , 2022, IEEE Transactions on Image Processing.

[8]  Jie Zhou,et al.  Efficient Non-Local Contrastive Attention for Image Super-Resolution , 2022, AAAI.

[9]  H. Pfister,et al.  Context Reasoning Attention Network for Image Super-Resolution , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Kishor P. Upla,et al.  Direct Unsupervised Super-Resolution Using Generative Adversarial Network (DUS-GAN) for Real-World Data , 2021, IEEE Transactions on Image Processing.

[11]  T. Zeng,et al.  Transformer for Single Image Super-Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[12]  Luc Van Gool,et al.  SwinIR: Image Restoration Using Swin Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[13]  Xiaolin Wu,et al.  Data Acquisition and Preparation for Dual-Reference Deep Learning of Image Super-Resolution , 2021, IEEE Transactions on Image Processing.

[14]  Licheng Jiao,et al.  Adversarial Multi-Path Residual Network for Image Super-Resolution , 2021, IEEE Transactions on Image Processing.

[15]  L. Gool,et al.  Video Super-Resolution Transformer , 2021, ArXiv.

[16]  Jianmin Bao,et al.  Uformer: A General U-Shaped Transformer for Image Restoration , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jiaya Jia,et al.  MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Resolution , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yun Fu,et al.  MR Image Super-Resolution with Squeeze and Excitation Reasoning Attention Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kishor P. Upla,et al.  Channel Split Convolutional Neural Network (ChaSNet) for Thermal Image Super-Resolution , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Zeyi Huang,et al.  Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition , 2021, NeurIPS.

[21]  R. Weinberg Restoration , 2021, Canadian Medical Association Journal.

[22]  Chunhua Shen,et al.  Twins: Revisiting the Design of Spatial Attention in Vision Transformers , 2021, NeurIPS.

[23]  Weijian Li,et al.  ConTNet: Why not use convolution and transformer at the same time? , 2021, ArXiv.

[24]  Zhuowen Tu,et al.  Co-Scale Conv-Attentional Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Fengwei Yu,et al.  Incorporating Convolution Designs into Visual Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Chunhua Shen,et al.  Conditional Positional Encodings for Vision Transformers , 2021, ICLR.

[30]  Francis E. H. Tay,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Chang Xu,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[34]  Rui Xu,et al.  VolumeNet: A Lightweight Parallel Network for Super-Resolution of MR and CT Volumetric Data , 2020, IEEE Transactions on Image Processing.

[35]  Xiaochun Cao,et al.  Correction to: Single Image Super-Resolution via a Holistic Attention Network , 2020, ECCV.

[36]  Wangmeng Zuo,et al.  Cross-Scale Internal Graph Neural Network for Image Super-Resolution , 2020, NeurIPS.

[37]  Sookyung Kim,et al.  Multi-Image Super-Resolution for Remote Sensing using Deep Recurrent Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Jie Tang,et al.  Residual Feature Aggregation Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Thomas S. Huang,et al.  Neural Sparse Representation for Image Restoration , 2020, NeurIPS.

[41]  Luc Van Gool,et al.  Deep Unfolding Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Fahad Shahbaz Khan,et al.  CycleISP: Real Image Restoration via Improved Data Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Mingkui Tan,et al.  Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Chang Xu,et al.  Efficient Residual Dense Block Search for Image Super-Resolution , 2019, AAAI.

[45]  Yu Qiao,et al.  RankSRGAN: Generative Adversarial Networks With Ranker for Image Super-Resolution , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Shu-Tao Xia,et al.  Second-Order Attention Network for Single Image Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Alexander Wong,et al.  RUNet: A Robust UNet Architecture for Image Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[48]  Nick Barnes,et al.  Real Image Denoising With Feature Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Wei Wu,et al.  Feedback Network for Image Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Yun Fu,et al.  Residual Non-local Attention Networks for Image Restoration , 2019, ICLR.

[51]  Y. Fu,et al.  Residual Dense Network for Image Restoration , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Kangfu Mei,et al.  Multi-scale Residual Network for Image Super-Resolution , 2018, ECCV.

[53]  Yu Qiao,et al.  ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks , 2018, ECCV Workshops.

[54]  Yun Fu,et al.  Image Super-Resolution Using Very Deep Residual Channel Attention Networks , 2018, ECCV.

[55]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[56]  Gregory Shakhnarovich,et al.  Deep Back-Projection Networks for Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Yun Fu,et al.  Residual Dense Network for Image Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Wangmeng Zuo,et al.  Learning a Single Convolutional Super-Resolution Network for Multiple Degradations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  R. Timofte,et al.  NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[61]  Kyoung Mu Lee,et al.  Enhanced Deep Residual Networks for Single Image Super-Resolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[62]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[63]  Wangmeng Zuo,et al.  Learning Deep CNN Denoiser Prior for Image Restoration , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Shuicheng Yan,et al.  Deep Joint Rain Detection and Removal from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Xiaoou Tang,et al.  Accelerating the Super-Resolution Convolutional Neural Network , 2016, ECCV.

[67]  Daniel Rueckert,et al.  Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  In-So Kweon,et al.  Learning a Deep Convolutional Network for Light-Field Image Super-Resolution , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[69]  Kyoung Mu Lee,et al.  Accurate Image Super-Resolution Using Very Deep Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Kiyoharu Aizawa,et al.  Sketch-based manga retrieval using manga109 dataset , 2015, Multimedia Tools and Applications.

[71]  Yuan Yan Tang,et al.  Weighted Couple Sparse Representation With Classified Regularization for Impulse Noise Removal , 2015, IEEE Transactions on Image Processing.

[72]  Narendra Ahuja,et al.  Single image super-resolution from transformed self-exemplars , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Xiaoou Tang,et al.  Learning a Deep Convolutional Network for Image Super-Resolution , 2014, ECCV.

[74]  Xuelong Li,et al.  Single Image Super-Resolution With Non-Local Means and Steering Kernel Regression , 2012, IEEE Transactions on Image Processing.

[75]  Aline Roumy,et al.  Low-Complexity Single-Image Super-Resolution based on Nonnegative Neighbor Embedding , 2012, BMVC.

[76]  Michael Elad,et al.  On Single Image Scale-Up Using Sparse-Representations , 2010, Curves and Surfaces.

[77]  Lei Zhang,et al.  An edge-guided image interpolation algorithm via directional filtering and data fusion , 2006, IEEE Transactions on Image Processing.

[78]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[79]  Yuqian Zhou,et al.  Supplementary File: Image Super-Resolution with Non-Local Sparse Attention , 2021 .

[80]  Zeyi Huang,et al.  Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length , 2021, ArXiv.

[81]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[82]  Thomas S. Huang,et al.  Image Super-Resolution With Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  L. Jiao,et al.  A Dual Residual Network with Channel Attention for Image Restoration , 2020, ECCV Workshops.

[84]  Han Fang,et al.  Generate to Adapt: Resolution Adaption Network for Surveillance Face Recognition , 2020, ECCV.

[85]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.