Multi-modality Deep Restoration of Extremely Compressed Face Videos

Arguably the most common and salient object in daily video communications is the talking head, as encountered in social media, virtual classrooms, teleconferences, news broadcasting, talk shows, etc. When communication bandwidth is limited by network congestions or cost effectiveness, compression artifacts in talking head videos are inevitable. The resulting video quality degradation is highly visible and objectionable due to high acuity of human visual system to faces. To solve this problem, we develop a multi-modality deep convolutional neural network method for restoring face videos that are aggressively compressed. The main innovation is a new DCNN architecture that incorporates known priors of multiple modalities: the video-synchronized speech signal and semantic elements of the compression code stream, including motion vectors, code partition map and quantization parameters. These priors strongly correlate with the latent video and hence they are able to enhance the capability of deep learning to remove compression artifacts. Ample empirical evidences are presented to validate the superior performance of the proposed DCNN method on face videos over the existing state-of-the-art methods.

[1]  Chen Change Loy,et al.  EDVR: Video Restoration With Enhanced Deformable Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Dong Xu,et al.  Deep Kalman Filtering Network for Video Compression Artifact Reduction , 2018, ECCV.

[3]  Victor Lempitsky,et al.  Few-Shot Adversarial Learning of Realistic Neural Talking Head Models , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Thomas Sikora,et al.  The MPEG-4 video standard verification model , 1997, IEEE Trans. Circuits Syst. Video Technol..

[5]  Xinfeng Zhang,et al.  Coding Prior Based High Efficiency Restoration for Compressed Video , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[6]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2020, ECCV.

[7]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[8]  Australia,et al.  Improving Deep Video Compression by Resolution-adaptive Flow Coding , 2020, ECCV.

[9]  Chenliang Xu,et al.  Generating Talking Face Landmarks from Speech , 2018, LVA/ICA.

[10]  Wen Gao,et al.  Compression Artifact Reduction by Overlapped-Block Transform Coefficient Estimation With Block Similarity , 2013, IEEE Transactions on Image Processing.

[11]  E. Kalogerakis,et al.  MakeItTalk: Speaker-Aware Talking Head Animation , 2020, ArXiv.

[12]  Xiaoyun Zhang,et al.  DVC: An End-To-End Deep Video Compression Framework , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xin Yu,et al.  Face Super-Resolution Guided by Facial Component Heatmaps , 2018, ECCV.

[15]  Tieniu Tan,et al.  Wavelet-SRNet: A Wavelet-Based CNN for Multi-scale Face Super Resolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Zulin Wang,et al.  Multi-frame Quality Enhancement for Compressed Video , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Xi Zhang,et al.  Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos , 2020, ACM Multimedia.

[18]  Jiajun Wu,et al.  Video Enhancement with Task-Oriented Flow , 2018, International Journal of Computer Vision.

[19]  Jian Yang,et al.  FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Yi Xu,et al.  Non-Local ConvLSTM for Video Compression Artifact Reduction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Chenliang Xu,et al.  TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Xin Yu,et al.  Super-Resolving Very Low-Resolution Face Images with Supplementary Attributes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Karen O. Egiazarian,et al.  Pointwise Shape-Adaptive DCT for High-Quality Denoising and Deblocking of Grayscale and Color Images , 2007, IEEE Transactions on Image Processing.

[24]  Jan Kautz,et al.  Deep Semantic Face Deblurring , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Hongyang Chao,et al.  One-To-Many Network for Visually Pleasing Compression Artifacts Reduction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Michael S. Brown,et al.  A Contrast Enhancement Framework with JPEG Artifacts Suppression , 2014, ECCV.

[27]  Xiaoou Tang,et al.  Compression Artifacts Reduction by a Deep Convolutional Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Taco S. Cohen,et al.  Video Compression With Rate-Distortion Autoencoders , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Radu Timofte,et al.  Exemplar Guided Face Image Super-Resolution Without Facial Landmarks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[30]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Liang Lin,et al.  Attention-Aware Face Hallucination via Deep Reinforcement Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Michael Elad,et al.  Postprocessing of Compressed Images via Sequential Denoising , 2015, IEEE Transactions on Image Processing.

[34]  Alberto Del Bimbo,et al.  Deep Generative Adversarial Compression Artifact Removal , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Xiaoyun Zhang,et al.  Enhancing HEVC Compressed Videos with a Partition-Masked Convolutional Neural Network , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[37]  Deqing Sun,et al.  Learning to Super-Resolve Blurry Face and Text Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  SahaGoutam,et al.  Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition , 2012 .

[39]  Michael K. Ng,et al.  Reducing Artifacts in JPEG Decompression Via a Learned Dictionary , 2014, IEEE Transactions on Signal Processing.

[40]  Lei Zhang,et al.  l2 Restoration of l∞-Decoded Images Via Soft-Decision Estimation , 2012, IEEE Trans. Image Process..

[41]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[42]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[43]  Xianming Liu,et al.  Data-Driven Soft Decoding of Compressed Images in Dual Transform-Pixel Domain , 2016, IEEE Transactions on Image Processing.

[44]  Dae-Shik Kim,et al.  Progressive Face Super-Resolution via Attention to Facial Landmark , 2019, BMVC.