The challenge of multispeaker lip-reading

In speech recognition, the problem of speaker variability has been well studied. Common approaches to dealing with it include normalising for a speaker's vocal tract length and learning a linear transform that moves the speaker-independent models closer to to a new speaker. In pure lip-reading (no audio) the problem has been less well studied. Results are often presented that are based on speaker-dependent (single speaker) or multispeaker (speakers in the test-set are also in the training-set) data, situations that are of limited use in real applications. This paper shows the danger of not using different speakers in the trainingand test-sets. Firstly, we present classification results on a new single-word database AVletters 2 which is a high-definition version of the well known AVletters database. By careful choice of features, we show that it is possible for the performance of visual-only lip-reading to be very close to that of audio-only recognition for the single speaker and multi-speaker configurations. However, in the speaker independent configuration, the performance of the visual-only channel degrades dramatically. By applying multidimensional scaling (MDS) to both the audio features and visual features, we demonstrate that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken. However, visual features are highly sensitive to the identity of the speaker, whereas audio features are relatively invariant.

[1]  M. Woodward,et al.  Phoneme perception in lipreading. , 1960, Journal of speech and hearing research.

[2]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[3]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[4]  J. Franks,et al.  The confusion of English consonant clusters in lipreading. , 1972, Journal of speech and hearing research.

[5]  A. Montgomery,et al.  Visual intelligibility of consonants: a lipreading screening test with implications for aural rehabilitation. , 1976, The Journal of speech and hearing disorders.

[6]  B. Walden,et al.  Effects of training on the visual recognition of consonants. , 1977, Journal of speech and hearing research.

[7]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[8]  S. Lesner Differences in visual intelligibility across talkers , 1982 .

[9]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[10]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[11]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[12]  Allen A. Montgomery,et al.  Automatic optically-based recognition of speech , 1988, Pattern Recognit. Lett..

[13]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[14]  D. Stork,et al.  Speechreading by Man and Machine: Models, Systems, and Applications , 1996 .

[15]  平山亮 会議報告-Speechreading by Humans and Machines; Models Systems and Applications , 1997 .

[16]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[17]  Gerasimos Potamianos,et al.  Speaker adaptation for audio-visual speech recognition , 1999, EUROSPEECH.

[18]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[21]  Barry-John Theobald,et al.  Visual speech synthesis using shape and appearance models , 2003 .

[22]  Chalapathy Neti,et al.  Joint audio-visual speech processing for recognition and enhancement , 2003, AVSP.

[23]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[24]  Stephen Cox,et al.  Lip-reading enhancement for law enforcement , 2006, SPIE Security + Defence.

[25]  Ben P. Milner,et al.  Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise , 2006, INTERSPEECH.

[26]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[27]  Sridha Sridharan,et al.  A unified approach to multi-pose audio-visual ASR , 2007, INTERSPEECH.