Escaping the complexity-bitrate-quality barriers of video encoders via deep perceptual optimization

We extend the concept of learnable video precoding (rate-aware neural-network processing prior to encoding) to deep perceptual optimization (DPO). Our framework comprises a pixel-to-pixel convolutional neural network that is trained based on the virtualization of core encoding blocks (block transform, quantization, block-based prediction) and multiple loss functions representing rate, distortion and visual quality of the virtual encoder. We evaluate our proposal with AVC/H.264 and AV1 under per-clip rate-quality optimization. The results show that DPO offers, on average, 14.2% bitrate reduction over AVC/H.264 and 12.5% bitrate reduction over AV1. Our framework is shown to improve both distortion- and perception-oriented metrics in a consistent manner, exhibiting only 3% outliers, which correspond to content with peculiar characteristics. Thus, DPO is shown to offer complexity-bitrate-quality tradeoffs that go beyond what conventional video encoders can offer.

[1]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[2]  Debargha Mukherjee,et al.  Machine Learning Accelerated Transform Search For AV1 , 2019, 2019 Picture Coding Symposium (PCS).

[3]  Eirikur Agustsson,et al.  Scale-Space Flow for End-to-End Optimized Video Compression , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Heiko Schwarz,et al.  Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[5]  Mariana Afonso,et al.  Video Compression Based on Spatio-Temporal Resolution Adaptation , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Anne Aaron,et al.  A large-scale video codec comparison of x264, x265 and libvpx for practical VOD applications , 2016, Optical Engineering + Applications.

[7]  Balu Adsumilli,et al.  YouTube UGC Dataset for Video Compression Research , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[8]  Steve Branson,et al.  Learned Video Compression , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Mohammed Ghanbari,et al.  Scope of validity of PSNR in image/video quality assessment , 2008 .

[10]  David Minnen,et al.  Variational image compression with a scale hyperprior , 2018, ICLR.

[11]  Lucas Theis,et al.  Lossy Image Compression with Compressive Autoencoders , 2017, ICLR.

[12]  Henrique S. Malvar,et al.  Low-complexity transform and quantization in H.264/AVC , 2003, IEEE Trans. Circuits Syst. Video Technol..

[13]  Zhou Wang,et al.  Blind Image Quality Assessment Using a Deep Bilinear Convolutional Neural Network , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Yochai Blau,et al.  The Perception-Distortion Tradeoff , 2017, CVPR.

[15]  Jungwon Lee,et al.  Variable Rate Deep Image Compression With a Conditional Autoencoder , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Sanghoon Lee,et al.  Deep Learning of Human Visual Sensitivity in Image Quality Assessment Framework , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yue Chen,et al.  An Overview of Core Coding Tools in the AV1 Video Codec , 2018, 2018 Picture Coding Symposium (PCS).

[18]  Ching-Han Chiang,et al.  A Multi-Pass Coding Mode Search Framework For AV1 Encoder Optimization , 2019, 2019 Data Compression Conference (DCC).

[19]  Lubomir D. Bourdev,et al.  Real-Time Adaptive Image Compression , 2017, ICML.

[20]  Abdelaziz Djelouah,et al.  Neural Inter-Frame Compression for Video Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Ioannis Katsavounidis,et al.  Video codec comparison using the dynamic optimizer framework , 2018, Optical Engineering + Applications.

[22]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[23]  Peyman Milanfar,et al.  Learned perceptual image enhancement , 2017, 2018 IEEE International Conference on Computational Photography (ICCP).

[24]  Vasileios Giotsas,et al.  Deep Video Precoding , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[25]  Valero Laparra,et al.  End-to-end Optimized Image Compression , 2016, ICLR.

[26]  Peyman Milanfar,et al.  NIMA: Neural Image Assessment , 2017, IEEE Transactions on Image Processing.

[27]  Xiaoyun Zhang,et al.  DVC: An End-To-End Deep Video Compression Framework , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Dmitriy Vatolin,et al.  Hacking VMAF with Video Color and Contrast Distortion , 2019, GraphiCon'2019 Proceedings. Volume 2.

[29]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.