Learned Source and Channel Coding for Talking-Head Semantic Transmission

How to efficiently transmit a special video over wireless channels? While the established systems work by combining H.26x video coding and 5G LDPC channel coding, its end-to-end transmission efficiency is still far away from the extreme for video sources in a specific domain. In this paper, we seek to design a special semantic communication system tailored for transmitting video calling streams over the wireless channels. Inspired by the recent progress in talking-head animation, we propose a talking- head semantic transmission (THST) system, which can efficiently transmit motion keypoint representation as compact semantic information to drive the free-view talk-heading synthesis at the receiver. Since the motion semantic key points are correlated, our THST system learns a nonlinear analysis transform to map the key points across multiple frames into latent space, then transmits the latent hyper semantic representation to the receiver via deep joint source-channel coding. Our system incorporates a latent prior to estimate the importance diversity on the semantic key points, accordingly, we realize variable rate joint source-channel coding to obtain system level coding gain. Extensive experimental validation shows that our THST system outperforms engineered competing systems on benchmark datasets. Moreover, due to the system level joint source and channel design, our method provides much more robust performance over noisy channels with only 33% bandwidth cost versus the current talking-head compression combined with 5G LDPC coded transmission systems.

[1]  Geoffrey Y. Li,et al.  Wireless Semantic Communications for Video Conferencing , 2022, IEEE Journal on Selected Areas in Communications.

[2]  Geoffrey Y. Li,et al.  Semantic Communications: Principles and Challenges , 2021, ArXiv.

[3]  Zhongwei Si,et al.  Nonlinear Transform Source-Channel Coding for Semantic Communications , 2021, IEEE Journal on Selected Areas in Communications.

[4]  Shiqi Wang,et al.  Image Quality Assessment: Unifying Structure and Texture Similarity , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Fangwei Zhang,et al.  Toward Wisdom-Evolutionary and Primitive-Concise 6G:A New Paradigm of Semantic Communication Networks , 2021, Engineering.

[6]  Arun Mallya,et al.  One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Nicu Sebe,et al.  First Order Motion Model for Image Animation , 2020, NeurIPS.

[9]  Nicu Sebe,et al.  Animating Arbitrary Objects via Deep Motion Transfer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Deniz Gündüz,et al.  Deep Joint Source-channel Coding for Wireless Image Transmission , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Luc Van Gool,et al.  Generative Adversarial Networks for Extreme Learned Image Compression , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[13]  Shrinivas Kudekar,et al.  Design of Low-Density Parity Check Codes for 5G New Radio , 2018, IEEE Communications Magazine.

[14]  David Minnen,et al.  Variational image compression with a scale hyperprior , 2018, ICLR.

[15]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[17]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[20]  Gary J. Sullivan,et al.  Overview of the High Efficiency Video Coding (HEVC) Standard , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .