论文信息 - Detection of specific mispronunciations using audiovisual features

Detection of specific mispronunciations using audiovisual features

This paper introduces a general approach for binaryclassification of audiovisual data. The intended application ismispronunciation detection for specific phonemic errors, usingvery sparse training data. The system uses a Support VectorMachine (SVM) classifier with features obtained from a TimeVarying Discrete Cosine Transform (TV-DCT) on the audiolog-spectrum as well as on the image sequences. Theconcatenated feature vectors from both the modalities werereduced to a very small subset using a combination of featureselection methods. We achieved 95-100% correctclassification for each pair-wise classifier on a database ofSwedish vowels with an average of 58 instances per vowel fortraining. The performance was largely unaffected when testedon data from a speaker who was not included in the training.

Sherif Abdou | Olov Engwall | Preben Wik | Gopal Ananthakrishnan | Sébastien Picard

[1] Daniel Neiberg,et al. Classification of Affective Speech using Normalized Time-Frequency Cepstra , 2010 .

[2] Timothy F. Cootes,et al. Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[3] Alexander H. Waibel,et al. See Me, Hear Me: Integrating Automatic Speech Recognition and Lip-reading , 1994 .

[4] Timothy F. Cootes,et al. Lipreading Using Shape, Shading and Scale , 1998, AVSP.

[5] David E. Goldberg,et al. Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[6] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[7] Jenq-Neng Hwang,et al. Lipreading from color video , 1997, IEEE Trans. Image Process..

[8] Gerasimos Potamianos,et al. Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9] Fuhui Long,et al. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Mattias Heldner,et al. Word level precision of the NALIGN automatic segmen- tation algorithm , 2004 .

[11] Goldberg,et al. Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[12] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[13] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[14] Hedvig Kjellström,et al. Audiovisual-to-articulatory inversion , 2009, Speech Commun..

[15] Thomas S. Huang,et al. Image processing , 1971 .