Measuring the relation between speech acoustics and 2D facial motion

Presents a quantitative analysis of the relation between speech acoustics and the 2D video signal of the facial motion that occurs simultaneously. 2D facial motion is acquired using an ordinary video camera: after digitizing a video sequence, a search algorithm is used for tracking markers painted on the speaker's face. Facial motion is represented by the 2D marker trajectories; whereas line spectrum pairs (LSP) coefficients are used to parameterize the speech acoustics. LSP coefficients and the marker trajectories are then used to train time-invariant and time-varying linear models, as well as nonlinear (neural network) models. These models are used to evaluate to what extent 2D facial motion is determined from speech acoustics. The correlation coefficients between measured and estimated trajectories are as high as 0.95. This estimation of facial motion from speech acoustics indicates a way to integrate audio and visual signals for efficient audio-visual speech coding.

[1]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[2]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[3]  Fumitada Itakura,et al.  Speech analysis and synthesis methods developed at ECL in NTT - From LPC to LSP - , 1986, Speech Commun..

[4]  D. Ostry,et al.  The equilibrium point hypothesis and its application to speech motor control. , 1996, Journal of speech and hearing research.

[5]  Hani Yehia,et al.  Unified physiological model of audible-visible speech production , 1997, EUROSPEECH.

[6]  E. Vatikiotis-Bateson,et al.  Estimation and generalization of multimodal speech production , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).