论文信息 - A cascade image transform for speaker independent automatic speechreading

A cascade image transform for speaker independent automatic speechreading

We propose a three-stage pixel based visual front end for automatic speechreading (lipreading) that results in improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied to a three-dimensional video region of interest that contains the speaker's mouth area. The first stage is a typical image compression transform that achieves a high "energy", reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis based data projection, which is applied to a concatenation of a small number of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform. Such a transform optimizes the likelihood of the observed data under the assumption of their class conditional Gaussian distribution with diagonal covariance. We apply the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 7-hour long, large vocabulary, continuous speech audio-visual dataset. We demonstrate significant classification accuracy gains by each added stage of the proposed algorithm, which, when combined, can reach up to 27% improvement. Overall, we achieve a 49% (38%) visual-only frame level phonetic classification accuracy with (without) use of test set phone boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end.

[1] Ramesh A. Gopinath,et al. Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2] Javier R. Movellan,et al. Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[3] Gerasimos Potamianos,et al. Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4] Anil K. Jain,et al. Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[5] David G. Stork,et al. Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .

[6] Yochai Konig,et al. "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] N. Michael Brooke. Talking Heads and Speech Recognisers That Can See: The Computer Processing of Visual Speech Signals , 1996 .

[8] F. A. Seiler,et al. Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[9] Juergen Luettin,et al. Towards speaker independent continuous speechreading , 1997, EUROSPEECH.

[10] Andrew W. Senior,et al. Face and Feature Finding for a Face Recognition System , 1999 .

[11] Gene H. Golub,et al. Matrix computations , 1983 .

[12] Chalapathy Neti,et al. Audio-visual large vocabulary continuous speech recognition in the broadcast domain , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[13] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[14] Stephen J. Cox,et al. Audiovisual speech recognition using multiscale nonlinear image decomposition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15] Gerasimos Potamianos,et al. Linear discriminant analysis for speechreading , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[16] Gerasimos Potamianos,et al. An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[17] William H. Press,et al. Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[18] Alexander H. Waibel,et al. Toward movement-invariant automatic lip-reading and speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[19] Jenq-Neng Hwang,et al. Lipreading from color video , 1997, IEEE Trans. Image Process..