Flow-Guided Sparse Transformer for Video Deblurring

Exploiting similar and sharper scene patches in spatio-temporal neighborhoods is critical for video deblurring. However, CNN-based methods show limitations in capturing long-range dependencies and modeling non-local self-similarity. In this paper, we propose a novel framework, Flow-Guided Sparse Transformer (FGST), for video deblurring. In FGST, we customize a selfattention module, Flow-Guided Sparse Windowbased Multi-head Self-Attention (FGSW-MSA). For each query element on the blurry reference frame, FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse yet highly related key elements corresponding to the same scene patch in neighboring frames. Besides, we present a Recurrent Embedding (RE) mechanism to transfer information from past frames and strengthen long-range temporal dependencies. Comprehensive experiments demonstrate that our proposed FGST outperforms state-of-the-art (SOTA) methods on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring. Code and models will be released to the public.

[1]  Daniel P. Huttenlocher,et al.  Generating sharp panoramas from motion-blurred videos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Matthijs Douze,et al.  XCiT: Cross-Covariance Image Transformers , 2021, NeurIPS.

[3]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[4]  Yair Weiss,et al.  From learning models of natural image patches to whole image restoration , 2011, 2011 International Conference on Computer Vision.

[5]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  A. N. Rajagopalan,et al.  Region-Adaptive Dense Network for Efficient Motion Deblurring , 2019, AAAI.

[7]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[10]  Chen Change Loy,et al.  EDVR: Video Restoration With Enhanced Deformable Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Bernhard Schölkopf,et al.  Learning Blind Motion Deblurring , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  A. Rajagopalan,et al.  Gated Spatio-Temporal Attention-Guided Video Deblurring , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Kurt Keutzer,et al.  Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Luc Van Gool,et al.  Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction , 2021, ArXiv.

[17]  Hongdong Li,et al.  Adversarial Spatio-Temporal Learning for Video Deblurring , 2018, IEEE Transactions on Image Processing.

[18]  Seungyong Lee,et al.  Video deblurring for hand-held cameras using patch-based synthesis , 2012, ACM Trans. Graph..

[19]  Roberto Cipolla,et al.  Visual tracking in the presence of motion blur , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Stefan Roth,et al.  Deep Video Deblurring: The Devil is in the Details , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[21]  Li Zhang,et al.  Global Aggregation then Local Distribution in Fully Convolutional Networks , 2019, BMVC.

[22]  Yanning Zhang,et al.  Multi-image Blind Deblurring Using a Coupled Adaptive Sparse Prior , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ian D. Reid,et al.  From Motion Blur to Motion Flow: A Deep Learning Solution for Removing Heterogeneous Motion Blur , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hao Wei,et al.  Deep Video Deblurring Using Sharpness Features From Exemplars , 2020, IEEE Transactions on Image Processing.

[25]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[26]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Guillermo Sapiro,et al.  Deep Video Deblurring for Hand-Held Cameras , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Wangmeng Zuo,et al.  Spatio-Temporal Filter Adaptive Network for Video Deblurring , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Ayan Chakrabarti,et al.  A Neural Approach to Blind Motion Deblurring , 2016, ECCV.

[31]  Yinqiang Zheng,et al.  Efficient Spatio-Temporal Recurrent Neural Network for Video Deblurring , 2020, ECCV.

[32]  Luc Van Gool,et al.  SwinIR: Image Restoration Using Swin Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[33]  Jiajun Wu,et al.  Video Enhancement with Task-Oriented Flow , 2018, International Journal of Computer Vision.

[34]  Michael J. Black,et al.  Optical Flow Estimation Using a Spatial Pyramid Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Philipp Krähenbühl,et al.  Center-based 3D Object Detection and Tracking , 2020, ArXiv.

[36]  Zhuowen Tu,et al.  Pose Recognition with Cascade Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jonathon Shlens,et al.  Scaling Local Self-Attention for Parameter Efficient Visual Backbones , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Qi Tian,et al.  Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation , 2021, ECCV Workshops.

[39]  H. Pfister,et al.  Learning to Generate Realistic Noisy Images via Pixel-level Noise-aware Adversarial Training , 2022, NeurIPS.

[40]  Bernhard Schölkopf,et al.  Online Video Deblurring via Dynamic Temporal Blending Network , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Xiangyu Zhang,et al.  Learning Delicate Local Representations for Multi-Person Pose Estimation , 2020, ECCV.

[42]  Shangchen Zhou,et al.  BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Haoqian Wang,et al.  Pseudo 3D Auto-Correlation Network for Real Image Denoising , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yi Wang,et al.  Scale-Recurrent Network for Deep Image Deblurring , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Jinhui Tang,et al.  Cascaded Deep Video Deblurring Using Temporal Sharpness Prior , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[47]  Jianmin Bao,et al.  Uformer: A General U-Shaped Transformer for Image Restoration , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Dongxu Li,et al.  ARVo: Learning All-Range Volumetric Correspondence for Video Deblurring , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jean Ponce,et al.  Learning a convolutional neural network for non-uniform motion blur removal , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[52]  Kyoung Mu Lee,et al.  Recurrent Neural Networks With Intra-Frame Iterations for Video Deblurring , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Tae Hyun Kim,et al.  Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Harry Shum,et al.  Full-frame video stabilization with motion inpainting , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Tae Hyun Kim,et al.  Generalized video deblurring for dynamic scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Thomas Brox,et al.  End-to-End Learning of Video Super-Resolution with Motion Compensation , 2017, GCPR.

[57]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.