Visual speech analysis and synthesis with application to Mandarin speech training

This paper presents a novel vision-based speech analysis system STODE which is used in spoken Chinese training of oral deaf children. Its design goal is to help oral deaf children overcome two major difficulties in speech learning: the confusion of intonations for spoken Chinese characters and timing errors within different words and characters. It integrates such capabilities as real-time lip tracking and feature extraction, multi-state lip modeling, Time-delay Neural Network (TDNN) for visual speech analysis. A desk-mounted camera tracks users in real-time. At each frame, region of interest is identified and key information is extracted. The preprocessed acoustic and visual information are then fed into a modular TDNN and combined for visual speech analysis. Confusion of intonations for spoken Chinese characters can be easily identified, and timing error within words and characters also can be detected using a DTW (Dynamic Time Warping) algorithm. For visual feedback we have created an artificial talking head directly cloned from user's own images to generate correct outputs showing both correct and wrong ways of pronunciation. This system has been successfully used for spoken Chinese training of oral deaf children in cooperation with Nanjing Oral School under grants from National Natural Science Foundation of China.