ReBotNet: Fast Real-time Video Enhancement

Most video restoration networks are slow, have high computational load, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time video enhancement for practical use-cases like live video calls and video streams. Our proposed method, called Recurrent Bottleneck Mixer Network (ReBotNet), employs a dual-branch framework. The first branch learns spatio-temporal features by tokenizing the input frames along the spatial and temporal dimensions using a ConvNext-based encoder and processing these abstract tokens using a bottleneck mixer. To further improve temporal consistency, the second branch employs a mixer directly on tokens extracted from individual frames. A common decoder then merges the features form the two branches to predict the enhanced frame. In addition, we propose a recurrent training approach where the last frame's prediction is leveraged to efficiently enhance the current frame while improving temporal consistency. To evaluate our method, we curate two new datasets that emulate real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.

[1]  Michael Elad,et al.  Image Denoising: The Deep Learning Revolution and Beyond - A Survey Paper - , 2023, SIAM J. Imaging Sci..

[2]  Yapeng Tian,et al.  STDAN: Deformable Attention Network for Space-Time Video Super-Resolution , 2022, IEEE transactions on neural networks and learning systems.

[3]  Matthieu Cord,et al.  ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jiaya Jia,et al.  What Makes for Good Tokenizers in Vision Transformer? , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  L. Gool,et al.  CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yinqiang Zheng,et al.  Blur Interpolation Transformer for Real-World Motion from Blur , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  J. Rumiński,et al.  Multi-task Video Enhancement for Dental Interventions , 2022, MICCAI.

[8]  B. Zeng,et al.  Deep Video Super-Resolution with Flow-Guided Deformable Alignment and Sparsity-based Temporal-Spatial Enhancement , 2022, IEEE International Workshop on Multimedia Signal Processing.

[9]  L. Gool,et al.  Towards Interpretable Video Super-Resolution via Alternating Optimization , 2022, ECCV.

[10]  Chao Dong,et al.  Rethinking Alignment in Video Super-Resolution Transformers , 2022, NeurIPS.

[11]  L. Gool,et al.  Recurrent Video Restoration Transformer with Guided Deformable Attention , 2022, NeurIPS.

[12]  C. Ngo,et al.  MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ling Zhao Motion Track Enhancement Method of Sports Video Image Based on OTSU Algorithm , 2022, Wireless Communications and Mobile Computing.

[14]  T. Treibitz,et al.  NAN: Noise-Aware NeRFs for Burst-Denoising , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dongdong Chen,et al.  Bringing Old Films Back to Life , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vishal M. Patel,et al.  UNeXt: MLP-based Rapid Medical Image Segmentation Network , 2022, MICCAI.

[17]  Tianhang Wang,et al.  Virtual Reality-Based Digital Restoration Methods and Applications for Ancient Buildings , 2022, Journal of mathematics.

[18]  Y. Fu,et al.  Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework , 2022, ICLR.

[19]  M. Fiedler,et al.  Videoconference Fatigue: A Conceptual Analysis , 2022, International journal of environmental research and public health.

[20]  L. Gool,et al.  VRT: A Video Restoration Transformer , 2022, IEEE Transactions on Image Processing.

[21]  Chong Luo,et al.  When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism , 2022, AAAI.

[22]  Ming-Hsuan Yang,et al.  Deep Image Deblurring: A Survey , 2022, International Journal of Computer Vision.

[23]  Malsha V. Perera,et al.  Transformer-Based SAR Image Despeckling , 2022, IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium.

[24]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  P. Milanfar,et al.  MAXIM: Multi-Axis MLP for Image Processing , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Vishal M. Patel,et al.  TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Shangchen Zhou,et al.  Investigating Tradeoffs in Real-World Video Super-Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Haocheng Wan,et al.  PatchFormer: An Efficient Point Transformer with Patch Attention , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Shuai Liu,et al.  An image enhancement algorithm of video surveillance scene based on deep learning , 2021, IET Image Process..

[30]  Yunfeng Cai,et al.  S2-MLP: Spatial-Shift MLP Architecture for Vision , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[31]  Jianmin Bao,et al.  Uformer: A General U-Shaped Transformer for Image Restoration , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Shangchen Zhou,et al.  BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Fanhua Shang,et al.  Video super-resolution based on deep learning: a comprehensive survey , 2020, Artificial Intelligence Review.

[34]  Lei Zhang,et al.  Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Luc Van Gool,et al.  SwinIR: Image Restoration Using Swin Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[36]  Seungyong Lee,et al.  Recurrent Video Deblurring with Blur-Invariant Motion Estimation and Pixel Volumes , 2021, ACM Trans. Graph..

[37]  Yingying Fan,et al.  SDNet: mutil-branch for single image deraining using swin , 2021, ArXiv.

[38]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[39]  Nitish Srivastava,et al.  An Attention Free Transformer , 2021, ArXiv.

[40]  Quoc V. Le,et al.  Pay Attention to MLPs , 2021, NeurIPS.

[41]  A. Dosovitskiy,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[42]  Radu Timofte,et al.  NTIRE 2021 Challenge on Quality Enhancement of Compressed Video: Methods and Results , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Matthew A. Brown,et al.  MoViNets: Mobile Video Networks for Efficient Video Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Chen Change Loy,et al.  BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Arun Mallya,et al.  One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[50]  Vishal M. Patel,et al.  Exploring Overcomplete Representations for Single Image Deraining Using CNNs , 2020, IEEE Journal of Selected Topics in Signal Processing.

[51]  Chen Change Loy,et al.  Understanding Deformable Alignment in Video Super-Resolution , 2020, AAAI.

[52]  Steven C. H. Hoi,et al.  Deep Learning for Image Super-Resolution: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Dong Zhao,et al.  Hybrid Local-Global Transformer for Image Dehazing , 2021, ArXiv.

[54]  Xin Wang,et al.  ETDNet: An Efficient Transformer Deraining Model , 2021, IEEE Access.

[55]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Chang-Su Kim,et al.  BMBC: Bilateral Motion Estimation with Bilateral Cost Volume for Video Interpolation , 2020, ECCV.

[57]  Lingyang Song,et al.  Improving Quality of Experience by Adaptive Video Streaming with Super-Resolution , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[58]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[59]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[60]  Shanxin Yuan,et al.  Video Super-Resolution With Temporal Group Attention , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Dandan Ding,et al.  A deep learning approach for quality enhancement of surveillance video , 2020, J. Intell. Transp. Syst..

[62]  Jinhui Tang,et al.  Cascaded Deep Video Deblurring Using Temporal Sharpness Prior , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Ding Liu,et al.  Scale-wise Convolution for Image Restoration , 2019, AAAI.

[64]  J. Delon,et al.  FastDVDnet: Towards Real-Time Deep Video Denoising Without Flow Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Chenliang Xu,et al.  TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Alex Sherstinsky,et al.  Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network , 2018, Physica D: Nonlinear Phenomena.

[67]  Yinqiang Zheng,et al.  Efficient Spatio-Temporal Recurrent Neural Network for Video Deblurring , 2020, ECCV.

[68]  Edgar Simo-Serra,et al.  DeepRemaster , 2019, ACM Trans. Graph..

[69]  Junjun Jiang,et al.  Progressive Fusion Video Super-Resolution Network via Exploiting Non-Local Spatio-Temporal Correlations , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  Radu Timofte,et al.  Efficient Video Super-Resolution through Recurrent Latent Space Propagation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[71]  Pankaj Kumar Sa,et al.  Blind Deblurring using Deep Learning: A Survey , 2019, ArXiv.

[72]  Julie Delon,et al.  DVDNET: A Fast Network for Deep Video Denoising , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[73]  Kyoung Mu Lee,et al.  Recurrent Neural Networks With Intra-Frame Iterations for Video Deblurring , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Radu Timofte,et al.  NTIRE 2019 Challenge on Video Deblurring and Super-Resolution: Dataset and Study , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[75]  Chen Change Loy,et al.  EDVR: Video Restoration With Enhanced Deformable Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[76]  Wangmeng Zuo,et al.  Spatio-Temporal Filter Adaptive Network for Video Deblurring , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[77]  Gregory Shakhnarovich,et al.  Recurrent Back-Projection Network for Video Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Yun Fu,et al.  Residual Non-local Attention Networks for Image Restoration , 2019, ICLR.

[79]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Jonathan T. Barron,et al.  A General and Adaptive Robust Loss Function , 2017, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Yong Xu,et al.  Deep Learning for Image Denoising: A Survey , 2018, ICGEC.

[82]  T. Peters,et al.  Augmented reality guidance in cerebrovascular surgery using microscopic video enhancement , 2018, Healthcare technology letters.

[83]  Bernhard Schölkopf,et al.  Spatio-Temporal Transformer Network for Video Restoration , 2018, ECCV.

[84]  Seoung Wug Oh,et al.  Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[85]  Bernt Schiele,et al.  Video Object Segmentation with Language Referring Expressions , 2018, ACCV.

[86]  Matthew A. Brown,et al.  Frame-Recurrent Video Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[87]  Wangmeng Zuo,et al.  Learning a Single Convolutional Super-Resolution Network for Multiple Degradations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[88]  W. Freeman,et al.  Video Enhancement with Task-Oriented Flow , 2017, International Journal of Computer Vision.

[89]  Xianming Liu,et al.  Robust Video Super-Resolution with Learned Temporal Dynamics , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[90]  Guillermo Sapiro,et al.  Deep Video Deblurring for Hand-Held Cameras , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  Thomas S. Huang,et al.  Balanced Two-Stage Residual Networks for Image Super-Resolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[92]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[93]  Renjie Liao,et al.  Detail-Revealing Deep Video Super-Resolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[94]  Tae Hyun Kim,et al.  Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  Christian Ledig,et al.  Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  V. P. Binu,et al.  Enhancement and security in surveillance video system , 2016, 2016 International Conference on Next Generation Intelligent Systems (ICNGIS).

[98]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[99]  Aggelos K. Katsaggelos,et al.  Video Super-Resolution With Convolutional Neural Networks , 2016, IEEE Transactions on Computational Imaging.

[100]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[101]  Liang Wang,et al.  Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution , 2015, NIPS.

[102]  Xiaoou Tang,et al.  Learning a Deep Convolutional Network for Image Super-Resolution , 2014, ECCV.

[103]  Deqing Sun,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 on Bayesian Adaptive Video Super Resolution , 2022 .

[104]  Leiting Chen,et al.  A Survey of Video Enhancement Techniques , 2012, J. Inf. Hiding Multim. Signal Process..

[105]  Peter H. N. de With,et al.  A real-time augmented-reality system for sports broadcast video enhancement , 2007, ACM Multimedia.

[106]  Shih-Fu Chang,et al.  Real-time content-based adaptive streaming of sports videos , 2001, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL 2001).

[107]  Albert Macovski,et al.  Lesion contrast enhancement in medical ultrasound imaging , 1997, IEEE Transactions on Medical Imaging.