An algorithm for speaker's lip motion detection is presented, based on the processing of a colour video sequence of speaker's face under natural lighting conditions and without any particular make-up. It is intended for applications in speech recognition, videoconferencing or speaker's face synthesis and animation. The algorithm is based on a statistical approach using Markov Random Field (MRF) modelling, with a spatiotemporal neighbourhood of the pixels in the image sequence. Two kinds of observations are used : the temporal difference between successive images (motion information) and the purity of red hue in the current and past images (spatial information about lip location). The field of hidden labels, relevant for lip motion detection, is obtained by energy minimisation and proves to be robust to lighting conditions (shadows). This label field is used to extract qualitative information (mouth opening and closing) but also quantitative information by measuring some geometrical features (horizontal and vertical lip spacing) directly on the label field.
[1]
Henryk Palus,et al.
The comparison between transformations from RGB colour space to IHS colour space, used for object recognition
,
1995
.
[2]
J. Besag.
On the Statistical Analysis of Dirty Pictures
,
1986
.
[3]
C. Benoît,et al.
A set of French visemes for visual speech synthesis
,
1994
.
[4]
Stephen J. Sangwine,et al.
Achieving brightness-insensitive measurements of colour saturation for use in object recognition
,
1995
.
[5]
David G. Stork,et al.
Speechreading by Humans and Machines
,
1996
.
[6]
Donald Geman,et al.
Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images
,
1984
.