This paper presents a novel vision-based speech analysis system STODE which is used in spoken Chinese training of oral deaf children. Its design goal is to help oral deaf children overcome two major difficulties in speech learning: the confusion of intonations for spoken Chinese characters and timing errors within different words and characters. It integrates such capabilities as real-time lip tracking and feature extraction, multi-state lip modeling, Time-delay Neural Network (TDNN) for visual speech analysis. A desk-mounted camera tracks users in real-time. At each frame, region of interest is identified and key information is extracted. The preprocessed acoustic and visual information are then fed into a modular TDNN and combined for visual speech analysis. Confusion of intonations for spoken Chinese characters can be easily identified, and timing error within words and characters also can be detected using a DTW (Dynamic Time Warping) algorithm. For visual feedback we have created an artificial talking head directly cloned from user's own images to generate correct outputs showing both correct and wrong ways of pronunciation. This system has been successfully used for spoken Chinese training of oral deaf children in cooperation with Nanjing Oral School under grants from National Natural Science Foundation of China.
[1]
Alex Waibel,et al.
Integrating time alignment and neural networks for high performance continuous speech recognition
,
1991,
[Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.
[2]
Michael Vogt.
Fast Matching of a Dynamic Lip Model to Color Video Sequences under Regular Illumination Conditions
,
1996
.
[3]
Alex Waibel,et al.
Intelligent animated agents for interactive language training
,
1998,
SIGC.
[4]
Gregory J. Wolff,et al.
Neural network lipreading system for improved speech recognition
,
1992,
[Proceedings 1992] IJCNN International Joint Conference on Neural Networks.
[5]
Geoffrey E. Hinton,et al.
Phoneme recognition using time-delay neural networks
,
1989,
IEEE Trans. Acoust. Speech Signal Process..
[6]
David Taylor.
Hearing by Eye: The Psychology of Lip-Reading
,
1988
.
[7]
G. Plant.
Perceiving Talking Faces: From Speech Perception to a Behavioral Principle
,
1999
.