Speech-driven face synthesis from 3D video

We present a framework for speech-driven synthesis of real faces from a corpus of 3D video of a person speaking. Video-rate capture of dynamic 3D face shape and colour appearance provides the basis for a visual speech synthesis model. A displacement map representation combines face shape and colour into a 3D video. This representation is used to efficiently register and integrate shape and colour information captured from multiple views. To allow visual speech synthesis viseme primitives are identified from the corpus using automatic speech recognition. A novel nonrigid alignment algorithm is introduced to estimate dense correspondence between 3D face shape and appearance for different visemes. The registered displacement map representation together with a novel optical flow optimisation using both shape and colour, enables accurate and efficient nonrigid alignment. Face synthesis from speech is performed by concatenation of the corresponding viseme sequence using the nonrigid correspondence to reproduce both 3D face shape and colour appearance. Concatenative synthesis reproduces both viseme timing and co-articulation. Face capture and synthesis has been performed for a database of 51 people. Results demonstrate synthesis of 3D visual speech animation with a quality comparable to the captured video of a person.

[1]  Pascal Fua,et al.  Regularized Bundle-Adjustment to Model Heads from Image Sequences without Calibration Data , 2000, International Journal of Computer Vision.

[2]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  David Salesin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH.

[4]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[5]  Luc Van Gool,et al.  Realistic face animation for speech , 2002, Comput. Animat. Virtual Worlds.

[6]  Adrian Hilton,et al.  Video-rate capture of dynamic face shape and appearance , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[7]  Manuel Menezes de Oliveira Neto,et al.  A hole-filling strategy for reconstruction of smooth surfaces in range images , 2003, 16th Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI 2003).

[8]  Zicheng Liu,et al.  Model-based bundle adjustment with application to face modeling , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[9]  David J. Fleet,et al.  Performance of optical flow techniques , 1994, International Journal of Computer Vision.

[10]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[11]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[12]  Nadia Magnenat-Thalmann,et al.  Head Modeling from Pictures and Morphing in 3D with Image Metamorphosis Based on Triangulation , 1998, CAPTECH.

[13]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[14]  Dimitris N. Metaxas,et al.  Deformable model-based shape and motion analysis from images using motion residual error , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[15]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[16]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..