论文信息 - Large Motion Video Super-Resolution with Dual Subnet and Multi-Stage Communicated Upsampling

Large Motion Video Super-Resolution with Dual Subnet and Multi-Stage Communicated Upsampling

Video super-resolution (VSR) aims at restoring a video in low-resolution (LR) and improving it to higher-resolution (HR). Due to the characteristics of video tasks, it is very important that motion information among frames should be well concerned, summarized and utilized for guidance in a VSR algorithm. Especially, when a video contains large motion, conventional methods easily bring incoherent results or artifacts. In this paper, we propose a novel deep neural network with Dual Subnet and Multi-stage Communicated Upsampling (DSMC) for super-resolution of videos with large motion. We design a new module named U-shaped residual dense network with 3D convolution (U3D-RDN) for fine implicit motion estimation and motion compensation (MEMC) as well as coarse spatial feature extraction. And we present a new Multi-Stage Communicated Upsampling (MSCU) module to make full use of the intermediate results of upsampling for guiding the VSR. Moreover, a novel dual subnet is devised to aid the training of our DSMC, whose dual loss helps to reduce the solution space as well as enhance the generalization ability. Our experimental results confirm that our method achieves superior performance on videos with large motion compared to state-of-the-art methods. Introduction Video super-resolution (VSR) aims at recovering the corresponding high-resolution (HR) counterpart from a given low-resolution (LR) video (Liu et al. 2020b). As an important computer vision task, it is a classic ill-posed problem. In recent years, due to the emergence of 5G technology and the popularity of high-definition (HD) and ultra high-definition (UHD) devices (Liu et al. 2020a), the VSR technology has attracted more attention from researchers and has become one of the research spotlights. Traditional super-resolution (SR) methods mainly include interpolation, statistical methods and sparse representation methods. In recent years, with the rapid development of deep neural networks, deep-learning-based VSR has attracted more attention among researchers. Due to the powerful data fitting and feature extraction ability, such algorithms are generally superior to traditional super-resolution techniques. *Corresponding Author: fhshang@xidian.edu.cn. Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The first deep-learning-based single image super-resolution (SISR) algorithm is SRCNN (Dong et al. 2015), while the first deep-learning-based VSR algorithm is Deep-DE (Liao et al. 2015). Since then, many deep-learning-based VSR algorithms have been proposed, such as VSRnet (Kappeler et al. 2016), 3DSRNet (Kim et al. 2018), RBPN (Haris, Shakhnarovich, and Ukita 2019) and TDAN (Tian et al. 2020). It may be considered that VSR can be achieved by using SISR algorithms frame by frame. However, SISR algorithms ignore temporal consistency between frames and easily brings artifact and jam, leading to worse visual experience. In contrast, VSR methods are usually able to process such consecutive frames and generate HR video with more natural details and less artifact. There are many VSR methods based on motion estimation and motion compensation (MEMC). They rely heavily on optical flow estimation for consecutive frames, and thus execute compensation to import temporal information to the center frame. For example, SOF-VSR (Wang et al. 2018a) proposed a coarse-to-fine CNN to gradually estimate HR optical flow in three stages. And RBPN (Haris, Shakhnarovich, and Ukita 2019) presented a back projection module which is composed of a sequence of consecutive alternate encoders and decoders after implicit alignment. They work well for videos consisting of scenes with small motion in a short time, as the optical flow estimation is accurate under such ideal scenes. However, in actual multimedia scenes, the motion are always diverse in different amplitudes. Especially when real-time shooting scenes like extreme sports have been popular, wearable shooting equipments can be widely used and often bring about video jitter. The jitter can easily bring about large motions. In visual tasks, large motion is always based on the consistency assumption of optical flow calculation (Gibson 1957). If the target motion changes too fast relative to the frame rate, this motion can be called large motion in videos. Moreover, some VSR methods do not perform explicit MEMC. They directly input multiple frames for spatiotemporal feature extraction, fusion and super-resolution, thus achieve implicit MEMC. For example, 3DSRNet (Kim et al. 2018) and FSTRN (Li et al. 2019) utilize 3D convolution (C3D) (Ji et al. 2012) to extract spatio-temporal correlation on the spatio-temporal domain. However, high computational complexity of C3D limits them to develop ar X iv :2 10 3. 11 74 4v 1 [ cs .C V ] 2 2 M ar 2 02 1 deeper structures. This probably results in their limited modeling and generalization ability, and the difficulty to adapt to videos with large motion. To address the above challenges, we propose a novel video super-resolution network with Dual Subnet and Multistage Communicated Upsampling (DSMC) to maximize the communication of various decisive information for videos with large motion. DSMC receives a center LR frame and its neighboring frames for each SR. After coarse-to-fine spatial feature extraction on the input frames, a U-shaped residual dense network with 3D convolution (U3D-RDN) is designed for DSMC. It can encode the input features and achieve both fine implicit MEMC and coarse spatial feature extraction on the encoding space as well as reducing the computational complexity. Then U3D-RDN decodes the features by a subpixel convolution upsampling layer. After another fine spatial feature extraction, a Multi-Stage Communicated Upsampling (MSCU) module is proposed to decompose an upsampling into multiple sub-tasks. It conducts feature correction with the help of the VSR result from each sub task, and thus makes full use of the intermediate results of upsampling for VSR guidance. Finally, a dual subnet is presented and used to simulate degradation of natural image, and the dual loss between the degraded VSR result and the original LR frame is computed to aid the training of DSMC. The main contributions of this paper are as follows: • We propose a DSMC network for super-resolution of videos with large motion, which is designed to maximize the communication of various decisive information in VSR process and implicitly capture the motion information. Our DSMC can guide the upsampling process with more sufficient prior knowledge than other state-ofthe-art ones by the proposed MSCU model. Meanwhile, the proposed U3D-RDN module can learn coarse-to-fine spatio-temporal features from the input video frames, and therefore effectively guide VSR process for large motion. • We propose a dual subnet for our DSMC, which can simulate natural image degradation to reduce the solution space, enhance the generalization ability and help DSMC for better training. • Extensive experiments have been carried out to evaluate the proposed DSMC. We compare it with several state-ofthe-art methods including optical-flow-based and C3Dbased ones. Experimental results confirm that DSMC is effective for videos with large motion as well as for generic videos without large motion. • Ablation study for each individual design has been conducted to investigate the effectiveness of our DSMC. We can find that MSCU has the greatest influence on the performance as it can recover more details through multistage communication. U3D-RDN is also effective for extracting motion information. The ablation study also indicates that the loss functions in dual subnet influence the training of DSMC when the original loss function is under different combinations of Cb and perceptual losses. Related Work SISR Methods Based on Deep Learning Recently, with the development of deep learning, superresolution algorithms based on deep learning usually perform much better than traditional methods in terms of various evaluation indicators, such as PSNR and SSIM. The first deep-learning-based SISR algorithm (called SRCNN) was proposed by Dong et al. (2015). It consists of three convolutional layers and learns a non-linear mapping from LR images to HR images by an end-to-end manner. Since then, many deep learning methods have been transferred to SISR, which help subsequent methods obtain greater performance. Inspired by VGG (Simonyan and Zisserman 2014), some methods generally adopt deeper network architecture, such as VDSR (Kim, Kwon Lee, and Mu Lee 2016), EDSR (Lim et al. 2017) and RCAN (Zhang et al. 2018a). However, these methods may suffer from gradient vanishment problem. Therefore, many algorithms such as RDN (Zhang et al. 2018b) introduce the skip connection between different layers inspired by the residual network (ResNet) (He et al. 2016b). In addition, the input size of SRCNN is the same as ground truth, which can lead to a high computational complexity. Therefore, most subsequent algorithms adopt a single LR image as input and execute upsampling on it at the end of the network, such as ESPCN (Shi et al. 2016) and DRN (Guo et al. 2020). Besides, other strategies such as attention mechanism (Mnih et al. 2014), non-local (Wang et al. 2018b) and dense connection (Huang et al. 2017) are also introduced to enhance the performance of SISR methods. VSR Methods Based on Deep Learning The earliest application of deep learning to VSR can be traced back to Deep-DE proposed by Liao et al. (2015). Since then, more advanced VSR methods have been proposed, such as VSRnet (Kappeler et al. 2016), VESPCN (Caballero et al. 2017), SOF-VSR (Wang et al. 2018a), RBPN (Haris, Shakhnarovich, and Ukita 2019), and 3DSRNet (Kim et al. 2018). For the VSR methods using 2D convolution, explicit MEMC is widely used and studied. VSRnet used the Druleas algorithm (Drulea and Nedevschi 2011) to calculate optical flows. In addition,

[1] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Renjie Liao,et al. Video Super-Resolution via Deep Draft-Ensemble Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[4] Yang Song,et al. Class-Balanced Loss Based on Effective Number of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Mingkui Tan,et al. Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Chenliang Xu,et al. TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Tie-Yan Liu,et al. Dual Learning for Machine Translation , 2016, NIPS.

[8] Radu Timofte,et al. NTIRE 2019 Challenge on Video Deblurring and Super-Resolution: Dataset and Study , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[10] Matthew A. Brown,et al. Frame-Recurrent Video Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] Nannan Wang,et al. Video Face Super-Resolution with Motion-Adaptive Feedback Cell , 2020, AAAI.

[12] Sergiu Nedevschi,et al. Total variation regularization of local-global optical flow , 2011, 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[13] Christian Ledig,et al. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Munchurl Kim,et al. 3DSRnet: Video Super-resolution using 3D Convolutional Neural Networks , 2018, ArXiv.

[15] Michael J. Black,et al. Optical Flow Estimation Using a Spatial Pyramid Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Gregory Shakhnarovich,et al. Deep Back-Projection Networks for Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Yun Fu,et al. Image Super-Resolution Using Very Deep Residual Channel Attention Networks , 2018, ECCV.

[19] Wei An,et al. Learning for Video Super-Resolution through HR Optical Flow Estimation , 2018, ACCV.

[20] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Wangmeng Zuo,et al. Blind Super-Resolution With Iterative Kernel Correction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Bo Du,et al. Fast Spatio-Temporal Residual Network for Video Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Sinisa Todorovic,et al. Temporal Deformable Residual Networks for Action Segmentation in Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24] Kyoung Mu Lee,et al. Enhanced Deep Residual Networks for Single Image Super-Resolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25] Gregory Shakhnarovich,et al. Recurrent Back-Projection Network for Video Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Thomas Brox,et al. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Kyoung Mu Lee,et al. Accurate Image Super-Resolution Using Very Deep Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Fanhua Shang,et al. Video super-resolution based on deep learning: a comprehensive survey , 2020, Artificial Intelligence Review.

[29] Yun Fu,et al. Residual Dense Network for Image Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30] Jiajun Wu,et al. Video Enhancement with Task-Oriented Flow , 2018, International Journal of Computer Vision.

[31] Fanhua Shang,et al. A Single Frame and Multi-Frame Joint Network for 360-degree Panorama Video Super-Resolution , 2020, ArXiv.

[32] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33] Luc Van Gool,et al. Dynamic Filter Networks , 2016, NIPS.

[34] Weidong Sheng,et al. Deformable 3D Convolution for Video Super-Resolution , 2020, IEEE Signal Processing Letters.

[35] Seoung Wug Oh,et al. Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Aggelos K. Katsaggelos,et al. Video Super-Resolution With Convolutional Neural Networks , 2016, IEEE Transactions on Computational Imaging.

[37] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[38] J. Gibson. Optical motions and transformations as stimuli for visual perception. , 1957, Psychological review.

[39] Xiaoou Tang,et al. Image Super-Resolution Using Deep Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.