A differentiable VMAF proxy as a loss function for video noise reduction

Traditional metrics for evaluating video quality do not completely capture the nuances of the Human Visual System (HVS), however they are simple to use for quantitatively optimizing parameters in enhancement or restoration. Modern Full-Reference Perceptual Visual Quality Metrics (PVQMs) such as the video multi-method assessment fusion (VMAF) function are more robust than traditional metrics in terms of the HVS, but they are generally complex and non-differentiable. This lack of differentiability means that they cannot be readily used in optimization scenarios for enhancement or restoration. In this paper we look at the formulation of a perceptually motivated restoration framework for video. We deploy this process in the context of denoising by training a spatio-temporal denoiser deep convultional neural network (DCNN). We design DCNNs as a differentiable proxy for both a spatial and temporal version of VMAF. These proxies are used as part of the proposed loss function in updating the weights of the spatio-temporal DCNNs. We use these proxies and traditional losses to propose a perceptually motivated loss function for video. Our results show that using the perceptual loss function as a fine tuning step yields a higher VMAF score and lower PSNR, when compared to the spatio-temporal network that is trained using the traditional mean squared error loss. Using the perceptual loss function for the entirety of training yields a lower VMAF and PSNR, but has visibly less noise in its output.

[1]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[2]  Alan C. Bovik,et al.  Image information and visual quality , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Fan Zhang,et al.  Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments , 2011, IEEE Transactions on Multimedia.

[4]  David Zhang,et al.  A comprehensive evaluation of full reference image quality assessment algorithms , 2012, 2012 19th IEEE International Conference on Image Processing.

[5]  Julie Delon,et al.  DVDNET: A Fast Network for Deep Video Denoising , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[6]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[7]  Zhan Ma,et al.  DeepCoder: A deep neural network based video compression , 2017, 2017 IEEE Visual Communications and Image Processing (VCIP).

[8]  Alan C. Bovik,et al.  Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures , 2009, IEEE Signal Processing Magazine.

[9]  Zhou Wang,et al.  Video quality assessment based on structural distortion measurement , 2004, Signal Process. Image Commun..

[10]  Jan van Gemert,et al.  ViDeNN: Deep Blind Video Denoising , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  C.-C. Jay Kuo,et al.  A fusion-based video quality assessment (fvqa) index , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[12]  Alan C. Bovik,et al.  ProxIQA: A Proxy Approach to Perceptual Optimization of Learned Image Compression , 2021, IEEE Transactions on Image Processing.

[13]  Jean-Michel Morel,et al.  Video Denoising via Empirical Bayesian Estimation of Space-Time Patches , 2017, Journal of Mathematical Imaging and Vision.

[14]  Balu Adsumilli,et al.  YouTube UGC Dataset for Video Compression Research , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[15]  Jan Kautz,et al.  Loss Functions for Image Restoration With Neural Networks , 2017, IEEE Transactions on Computational Imaging.

[16]  Anil C. Kokaram,et al.  A Perceptual Quality Metric for Videos Distorted by Spatially Correlated Noise , 2016, ACM Multimedia.

[17]  Aggelos K. Katsaggelos,et al.  Video Super-Resolution With Convolutional Neural Networks , 2016, IEEE Transactions on Computational Imaging.

[18]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Bernhard Schölkopf,et al.  Spatio-Temporal Transformer Network for Video Restoration , 2018, ECCV.