Real-time lip trackers for use in audio-visual speech recognition

Human speech is inherently multi-modal, consisting of both audio and visual components. The increased computing power of general-purpose workstations and PCs has made it possible to extract visual features in real-time that can be used to supplement acoustic-only speech recognisers, enabling robust recognition of speech in the presence of acoustic noise. In order to achieve real-time performance, previous work has utilised a dynamic contour framework with cosmetically assisted lips. It is shown that unadorned lips can be suitably tracked in real-time without cosmetic assistance. In addition, a coupled head-lip tracker is presented which provides accurate, stable, lip tracking throughout a range of head positions and pose. As well as improving tracking performance, the coupling of the head tracker to the lip tracker provides an opportunity for the extraction of visual recognition features that are invariant to head position-a must for audio-visual recognition of unconstrained speakers. (6 pages)