论文信息 - Audio-visual classification of Swedish phonemes for pronun ciation training

Audio-visual classification of Swedish phonemes for pronun ciation training

We present a method for audio-visual classification of Swedi sh phonemes, to be used in computer-assisted pronunciation training. The probabilistic kernel-based method is applied to the audio signal and/or either a principal or an independent component (PCA or ICA) representation of the mouth region in video images. We investigate which representation (PCA or ICA) that may be most suitable and the number of components required in the base, in order to be able to automatically detect pronunciation errors in Swedish from audio-visual input. Experiments performed on one speaker show that the visual information help avoiding classification errors that would lead to gravely er roneous feedback to the user; that it is better to perform phoneme classification on audio and video seperately and then fuse th e results, rather than combining them before classification; and that PCA outperforms ICA for few components. Index Terms: audiovisual phoneme classification, pronunciation error detection, PCA, ICA

Sherif Abdou | Olov Engwall | Hedvig Kjellstr

[1] Fumitada Itakura,et al. Speech analysis and synthesis methods developed at ECL in NTT - From LPC to LSP - , 1986, Speech Commun..

[2] Alexander J. Smola,et al. Learning with kernels , 1998 .

[3] Marian Stewart Bartlett,et al. Classifying Facial Actions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[4] Andrew Blake,et al. Accurate, real-time, unadorned lip tracking , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[5] Olle Bälter,et al. Reconstructing tongue movements from audio and video , 2006, INTERSPEECH.

[6] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7] On speechreading of Swedish consonants and vowels , 2007 .

[8] Josef Kittler,et al. Independent component analysis in a local facial residue space for face recognition , 2004, Pattern Recognit..

[9] Rolf-Rainer Grigat,et al. A system for audio-visual speech recognition , 2005, INTERSPEECH.

[10] Darryl Stewart,et al. A new posterior based audio-visual integration method for robust speech recognition , 2005, INTERSPEECH.

[11] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..