Video Super-Resolution Transformer

Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. Thus, it seems to be straightforward to apply the vision Transformer to solve VSR. However, the typical block design of Transformer with a fully connected self-attention layer and a tokenwise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings without any interaction among them. In this paper, we make the first attempt to adapt Transformer for VSR. Specifically, to tackle the first issue, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information. For the second issue, we design a bidirectional optical flow-based feed-forward layer to discover the correlations across different video frames and also align features. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. The code will be available at https://github.com/caojiezhang/VSR-Transformer.

[1]  Radu Timofte,et al.  NTIRE 2019 Challenge on Video Deblurring and Super-Resolution: Dataset and Study , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Mingkui Tan,et al.  Multi-marginal Wasserstein GAN , 2019, NeurIPS.

[3]  Xi Xiao,et al.  Adversarial Sparse Transformer for Time Series Forecasting , 2020, NeurIPS.

[4]  Mingkui Tan,et al.  Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ronggang Wang,et al.  COLA-Net: Collaborative Attention Network for Image Restoration , 2021, ArXiv.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Yun Fu,et al.  Image Super-Resolution Using Very Deep Residual Channel Attention Networks , 2018, ECCV.

[8]  Matthew A. Brown,et al.  Frame-Recurrent Video Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Shanxin Yuan,et al.  Video Super-Resolution With Temporal Group Attention , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Amit Daniely,et al.  Learning Parities with Neural Networks , 2020, NeurIPS.

[13]  Seoung Wug Oh,et al.  Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Renjie Liao,et al.  Detail-Revealing Deep Video Super-Resolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Luc Van Gool,et al.  LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[16]  Deqing Sun,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 on Bayesian Adaptive Video Super Resolution , 2022 .

[17]  Shai Shalev-Shwartz,et al.  Computational Separation Between Convolutional and Fully-Connected Networks , 2020, ICLR.

[18]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jiajun Wu,et al.  Video Enhancement with Task-Oriented Flow , 2018, International Journal of Computer Vision.

[20]  Chen Change Loy,et al.  EDVR: Video Restoration With Enhanced Deformable Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[22]  Gregory Shakhnarovich,et al.  Recurrent Back-Projection Network for Video Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Chenliang Xu,et al.  TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Qi Tian,et al.  Video Super-Resolution with Recurrent Structure-Detail Network , 2020, ECCV.

[27]  Christian Ledig,et al.  Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Qingyao Wu,et al.  Adversarial Learning with Local Coordinate Coding , 2018, ICML.

[30]  Michael J. Black,et al.  Optical Flow Estimation Using a Spatial Pyramid Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hua Wang,et al.  Deformable Non-Local Network for Video Super-Resolution , 2019, IEEE Access.

[32]  Chen Change Loy,et al.  BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xu Jia,et al.  Revisiting Temporal Modeling for Video Super-resolution , 2020, BMVC.