Development of text-to-audiovisual speech synthesis to support interactive language learning on a mobile device

We have developed distributed text-to-audiovisual-speech synthesizer (TTAVS) to support interactivity in computer-aided pronunciation training (CAPT) on a mobile platform. The TTAVS serves to generate audiovisual corrective feedback based on detected mispronunciations from the second language learner's speech. Our approach encodes key visemes in SVG format that are compressed by GZIP and transmitted to the client, where the browser can perform real-time morphing to render the visual speech. We have also developed a TTAVS animation player that can play the audio and visual speech synchronously while enabling user controls in play/pause/resume. Evaluation shows that this newly proposed approach, vis-à-vis our original approach that involves generation of an Ogg video on the server-side which is streamed to the client, achieves a significant reduction (66%) in average size of the output files that are transmitted from the server to the client, reduction of (83%) in client waiting times, as well as preserve the quality of the image.