Personalized Speech2Video with 3D Skeleton Regularization and Expressive Body Poses

In this paper, we propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic, and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional generative adversarial network (GAN). To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps our model quickly learn meaningful body movement through a few recorded videos. To produce photo-realistic and high-resolution video with motion details, we propose to insert part attention mechanisms in the conditional GAN, where each detailed part, e.g. head and hand, is automatically zoomed in to have their own discriminators. To validate our approach, we collect a dataset with 20 high-quality videos from 1 male and 1 female model reading various documents under different topics. Compared with previous SoTA pipelines handling similar tasks, our approach achieves better results by a user study.

[1]  Chi-Keung Tang,et al.  Deep Video Generation, Prediction and Completion of Human Action Sequences , 2017, ECCV.

[2]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Victor Lempitsky,et al.  Textured Neural Avatars , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Wenhan Luo,et al.  Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[15]  Adam Finkelstein,et al.  Text-based editing of talking-head video , 2019, ACM Trans. Graph..

[16]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[18]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[19]  Varun Ganapathi,et al.  LumièreNet: Lecture Video Synthesis from Audio , 2019, ArXiv.

[20]  Justus Thies,et al.  Deferred Neural Rendering: Image Synthesis using Neural Textures , 2019 .

[21]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Bob Katz,et al.  Mastering Audio: The Art and the Science , 2002 .

[23]  Gaurav Mittal,et al.  Animating Face using Disentangled Audio Representations , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[24]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2019, ECCV.

[25]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[26]  Dahua Lin,et al.  Convolutional Sequence Generation for Skeleton-Based Action Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[28]  M. A. Anusuya,et al.  Speech Recognition by Machine, A Review , 2010, ArXiv.

[29]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[30]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[31]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Yi Zhou,et al.  Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis , 2017, ICLR.