Speech Driven Talking Head Generation via Attentional Landmarks Based Representation

Previous talking head generation methods mostly focus on frontal face synthesis while neglecting natural person head motion. In this paper, a generative adversarial network (GAN) based method is proposed to generate talking head video with not only high quality facial appearance, accurate lip movement, but also natural head motion. To this aim, the facial landmarks are detected and used to represent lip motion and head pose, and the conversions from speech to these middle level representations are learned separately through Convolutional Neural Networks (CNN) with wingloss. The Gated Recurrent Unit (GRU) is adopted to regularize the sequential transition. The representations for different factors of talking head are jointly feeded to a Generative Adversarial Network (GAN) based model with an attentional mechanism to synthesize the talking video.Extensive experiments on the benchmark dataset as well as our own collected dataset validate that the propose method can yield talking videos with natural head motions, and the performance is superior to state-of-the-art talking face generation methods.