Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling

This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Welch DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable

[1]  David Salesin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH.

[2]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[3]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[4]  Jörn Ostermann,et al.  Lifelike talking faces for interactive services , 2003, Proc. IEEE.

[5]  Jenq-Neng Hwang,et al.  Noisy speech recognition using robust inversion of hidden Markov models , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[7]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[8]  Hans-Peter Seidel,et al.  Head shop: generating animated head models with anatomical structure , 2002, SCA '02.

[9]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[10]  Geoffrey Zweig,et al.  Bayesian network structures and inference techniques for automatic speech recognition , 2003, Comput. Speech Lang..

[11]  Jenq-Neng Hwang,et al.  Baum-Welch hidden Markov model inversion for reliable audio-to-visual conversion , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[12]  J. Goldsmith Autosegmental and Metrical Phonology , 1990 .

[13]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[14]  Jenq-Neng Hwang,et al.  Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System , 2001, J. VLSI Signal Process..

[15]  J. Bilmes,et al.  Discriminatively Structured Graphical Models for Speech Recognition The Graphical Models Team JHU 2001 Summer Workshop , 2001 .

[16]  Mahesh Viswanathan,et al.  Recent improvements to the IBM trainable speech synthesis system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[18]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[19]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[20]  L. Baum,et al.  Growth transformations for functions on manifolds. , 1968 .

[21]  Jonas Beskow,et al.  Picture my voice: Audio to visual speech synthesis using artificial neural networks , 1999, AVSP.

[22]  Eddie Kohler,et al.  Real-time speech motion synthesis from recorded motions , 2004, SCA '04.

[23]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[24]  H.P. Graf,et al.  Lip synchronization using speech-assisted video processing , 1995, IEEE Signal Processing Letters.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[27]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[28]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[29]  Jörn Ostermann,et al.  Talking faces - technologies and applications , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[30]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[31]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .