A comparison of model and transform-based visual features for audio-visual LVCSR

Four different visual speech parameterisation methods are compared on a large vocabulary, continuous, audio-visual speech recognition task using the IBM ViaVoice TM audio-visual speech database. Three are direct mouth image region based transforms; discrete cosine and wavelet transforms, and principal component analysis. The fourth uses a statistical model of shape and appearance called an active appearance model, to track and obtain model parameters describing the entire face. All parameterisations are compared experimentally using hidden Markov models (HMM’s) in a speaker independent test. Visualonly HMM’s are used to rescore lattices obtained from audio models trained in noisy conditions.

[1]  Andrew W. Senior,et al.  Face and Feature Finding for a Face Recognition System , 1999 .

[2]  Giridharan Iyengar,et al.  A cascade image transform for speaker independent automatic speechreading , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[3]  Steve Young,et al.  The HTK book , 1995 .

[4]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[5]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[6]  平山亮 会議報告-Speechreading by Humans and Machines; Models Systems and Applications , 1997 .

[7]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[8]  Eric Vatikiotis-Bateson,et al.  The Dynamics of Audiovisual Behavior in Speech , 1996 .

[9]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[10]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .