LIP movement generation using restricted Boltzmann machines for visual speech synthesis

This paper proposes methods of using restricted Boltzmann machines (RBM) to generate the sequence of lip images for visual speech synthesis. The aim of our proposed methods is to alleviate the over-smoothing effect of the conventional hidden Markov model (HMM) based statistical approach for lip synthesis. Two model structures using RBMs to model and generate lip movements are investigated in this paper. First, RBMs are adopted to replace Gaussian distributions as the density functions of HMM states. Second, a deep belief network (DBN) is constructed by stacking up multiple RBMs to model the joint distribution between the lip image of each frame and its corresponding context features. Experimental results show that our proposed methods can improve the quality of generated lip images significantly. The method of using DBN model structure and raw pixel features achieves the best performance in our experiments.

[1]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[2]  Jörn Ostermann,et al.  Talking faces - technologies and applications , 2004, ICPR 2004.

[3]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[4]  Jörn Ostermann,et al.  Talking faces - technologies and applications , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[5]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[6]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Helen M. Meng,et al.  Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  John Wright,et al.  RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[11]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[12]  Zhenhua Ling HMM-based Unit Selection Using F , 2006 .

[13]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[14]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Frank K. Soong,et al.  Synthesizing photo-real talking head via trajectory-guided sample selection , 2010, INTERSPEECH.

[16]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[17]  Ruslan Salakhutdinov,et al.  Learning Deep Generative Models , 2009 .

[18]  Dong Yu,et al.  Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  政子 鶴岡,et al.  1998 IEEE International Conference on SMCに参加して , 1998 .

[20]  Jörn Ostermann,et al.  Realistic facial animation system for interactive services , 2008, INTERSPEECH.