Visual speech synthesis from 3D mesh sequences driven by combined speech features

Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature produces more expressive facial animation and it robustly handles temporal labeling errors. Furthermore, by employing a sliding window approach to feature extraction, the system is easy to train and allows for low-delay synthesis. We show that our novel combination of speech features improves visual speech synthesis. Our findings are confirmed by a subjective user study.

[1]  Thabo Beeler,et al.  Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[2]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[4]  Li Zhang,et al.  Dynamic, expressive speech animation from a single mesh , 2007, SCA '07.

[5]  Luc Van Gool,et al.  A 3-D Audio-Visual Corpus of Affective Communication , 2010, IEEE Transactions on Multimedia.

[6]  Barry-John Theobald,et al.  Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Christian Theobalt,et al.  Reconstruction of Personalized 3D Face Rigs from Monocular Video , 2016, ACM Trans. Graph..

[9]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[10]  Pascal Müller,et al.  Realistic speech animation based on observed 3-D face dynamics , 2005 .

[11]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[12]  Ricardo Gutierrez-Osuna,et al.  A comparison of acoustic coding models for speech-driven facial animation , 2006, Speech Commun..

[13]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[14]  Alan W. Black,et al.  Evaluating and correcting phoneme segmentation for unit selection synthesis , 2003, INTERSPEECH.

[15]  Yisong Yue,et al.  A Decision Tree Framework for Spatiotemporal Sequence Prediction , 2015, KDD.

[16]  Frank K. Soong,et al.  Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[17]  Willy Wong,et al.  A linear model of acoustic-to-facial mapping: model parameters, data set size, and generalization across speakers. , 2008, The Journal of the Acoustical Society of America.

[18]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Bernard Girau,et al.  A comprehensive system for facial animation of generic 3D head models driven by speech , 2011, EURASIP Journal on Audio, Speech, and Music Processing.

[20]  Jörn Ostermann,et al.  Realistic facial expression synthesis for an image-based talking head , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[21]  György Takács Direct, modular and hybrid audio to visual speech conversion methods - a comparative study , 2009, INTERSPEECH.