Synthetic visual speech driven from auditory speech

We have developed two different methods for using auditory, telephone speech to drive the movements of a synthetic face. In the first method, Hidden Markov Models (HMMs) were trained on a phonetically transcribed telephone speech database. The output of the HMMs was then fed into a rulebased visual speech synthesizer as a string of phonemes together with time labels. In the second method, Artificial Neural Networks (ANNs) were trained on the same database to map acoustic parameters directly to facial control parameters. These target parameter trajectories were generated by using phoneme strings from a database as input to the visual speech synthesis The two methods were evaluated through audiovisual intelligibility tests with ten hearing impaired persons, and compared to “ideal” articulations (where no recognition was involved), a natural face, and to the intelligibility of the audio alone. It was found that the HMM method performs considerably better than the audio alone condition (54% and 34% keywords correct respectively), but not as well as the “ideal” articulating artificial face (64%). The intelligibility for the ANN method was 34% keywords correct.

[1]  Paul Duchnowski,et al.  Automatic Generation of Cued Speech for The Deaf: Status and Outlook , 1998, AVSP.

[2]  Jonas Beskow,et al.  Rule-based visual speech synthesis , 1995, EUROSPEECH.

[3]  Børge Lindberg,et al.  Test Set Definition and Specification , 1998 .

[4]  Shigeo Morishima,et al.  Real-time Talking Head Driven by Voice and its Application to Communication and Entertainment , 1998, AVSP.

[5]  Björn Granström,et al.  Synthetic faces as a lipreading support , 1998, ICSLP.

[6]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  Giampiero Salvi Developing acoustic models for automatic speech recognition in swedish , 1999 .

[9]  Björn Granström,et al.  The teleface project multi-modal speech-communication for the hearing impaired , 1997, EUROSPEECH.

[10]  Richard Winski,et al.  European speech databases for telephone applications , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[12]  Nikko Strom,et al.  Phoneme probability estimation with dynamic sparsely connected artificial neural networks , 1997 .

[13]  Giampiero Salvi,et al.  Using HMMs and ANNs for mapping acoustic to visual speech , 1999 .