Acoustic speech to lip feature mapping for multimedia applications

This paper presents a quantitative analysis of the relationship between acoustic speech and corresponding lip features. The lip features, such as the lip width, inner lip height and outer lip height, are acquired by an extraction algorithm combining both color and edge information within a Markov random field (MRF) framework. Meanwhile, LSP (linear spectrum pairs) coefficients are used to parameterize the acoustic speech. LSP coefficients and the lip features are then used to train mapping models. The resulting models are used to estimate lip features from acoustic speech. From the results, we can see that the measured lip features match fairly well with the estimated lip features. The correlation coefficients between measured and estimated lip features are as high as 0.90. The estimation technique of lip features from acoustic speech gives a way to integrate acoustic and visual speech, which is very useful for speech driven face animation, audio-video synchronization and foreign film dubbing.

[1]  Ling Li,et al.  The distance measure for line spectrum pairs applied to speech recognition , 1998, ICSLP.

[2]  Christopher M. Brown,et al.  The theory and practice of Bayesian image labeling , 1990, International Journal of Computer Vision.

[3]  Martin Fodslette Møller,et al.  Supervised Learning On Large Redundant Training Sets , 1993, Int. J. Neural Syst..

[4]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[5]  Thrasyvoulos N. Pappas An adaptive clustering algorithm for image segmentation , 1992, IEEE Trans. Signal Process..

[6]  Satoshi Nakamura,et al.  Speech-to-lip movement synthesis maximizing audio-visual joint probability based on EM algorithm , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[7]  Tsuhan Chen,et al.  Real-time lip-synch face animation driven by human voice , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[8]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[9]  Hani Yehia,et al.  Measuring the relation between speech acoustics and 2D facial motion , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  E. Vatikiotis-Bateson,et al.  Estimation and generalization of multimodal speech production , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[11]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.