Estimation of speech acoustics from visual speech features: A comparison of linear and non-linear models

This paper examines the degree of correlation between lip and jaw con guration and speech acoustics. The lip and jaw positions are characterised by a system of measurements taken from video images of the speaker's face and pro le, and the acoustics are represented using line spectral pair parameters and a measure of RMS energy. A correlation is found between the measured acoustic parameters and a linear estimate of the acoustics recovered from the visual data. This correlation exists despite the simplicity of the mapping and is in rough agreement with correlations measured in earlier work by Yehia et al. The linear estimates are also compared to estimates made using nonlinear models. In particular it is shown that although performance of the two models is remarkably similar for static visual features, non-linear models are better able to handle dynamic features.