论文信息 - Visual speech synthesis from 3D mesh sequences driven by combined speech features

Visual speech synthesis from 3D mesh sequences driven by combined speech features

Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature produces more expressive facial animation and it robustly handles temporal labeling errors. Furthermore, by employing a sliding window approach to feature extraction, the system is easy to train and allows for low-delay synthesis. We show that our novel combination of speech features improves visual speech synthesis. Our findings are confirmed by a subjective user study.

Jörn Ostermann | Felix Kuhnke

[1] Thabo Beeler,et al. Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[2] Timothy F. Cootes,et al. Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[3] Frédéric H. Pighin,et al. Expressive speech-driven facial animation , 2005, TOGS.

[4] Li Zhang,et al. Dynamic, expressive speech animation from a single mesh , 2007, SCA '07.

[5] Luc Van Gool,et al. A 3-D Audio-Visual Corpus of Affective Communication , 2010, IEEE Transactions on Multimedia.

[6] Barry-John Theobald,et al. Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[8] Christian Theobalt,et al. Reconstruction of Personalized 3D Face Rigs from Monocular Video , 2016, ACM Trans. Graph..

[9] Eugene Fiume,et al. JALI , 2016, ACM Trans. Graph..

[10] Pascal Müller,et al. Realistic speech animation based on observed 3-D face dynamics , 2005 .

[11] Moshe Mahler,et al. Dynamic units of visual speech , 2012, SCA '12.

[12] Ricardo Gutierrez-Osuna,et al. A comparison of acoustic coding models for speech-driven facial animation , 2006, Speech Commun..

[13] Wesley Mattheyses,et al. Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[14] Alan W. Black,et al. Evaluating and correcting phoneme segmentation for unit selection synthesis , 2003, INTERSPEECH.

[15] Yisong Yue,et al. A Decision Tree Framework for Spatiotemporal Sequence Prediction , 2015, KDD.

[16] Frank K. Soong,et al. Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[17] Willy Wong,et al. A linear model of acoustic-to-facial mapping: model parameters, data set size, and generalization across speakers. , 2008, The Journal of the Acoustical Society of America.

[18] Björn Stenger,et al. Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Bernard Girau,et al. A comprehensive system for facial animation of generic 3D head models driven by speech , 2011, EURASIP Journal on Audio, Speech, and Music Processing.

[20] Jörn Ostermann,et al. Realistic facial expression synthesis for an image-based talking head , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[21] György Takács. Direct, modular and hybrid audio to visual speech conversion methods - a comparative study , 2009, INTERSPEECH.