Multimodal speech synthesis

Multimodal speech synthesis ("talking heads") encompasses synthesis of speech from text ("text-to-speech", TTS) plus synthesis of a visual presentation of a face that is lip-synced to the generated audio ("visual TTS", VTTS). Talking heads are now practical because of the ever-increasing computing power and falling prices of computer hardware. This paper highlights recent technological breakthroughs relevant to the two modalities. In addition, it exposes synergies between the audio and visual technology components. Finally, the paper summarizes test results that highlight the impact of multimodal speech synthesis in communications and e-commerce applications.

[1]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[2]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Roger K. Moore,et al.  Handbook of standards and resources for spoken language systems , 1997 .

[4]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[5]  Jörn Ostermann,et al.  Animated Talking Head with Personalized 3D Head Model , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[6]  R.A.M.G. van Bezooijen,et al.  Assessment of synthesis systems , 1997 .

[7]  Jörn Ostermann,et al.  Animation of synthetic faces in MPEG-4 , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[8]  Jörn Ostermann,et al.  Animated Talking Head with Personalized 3D Head Model , 1998, J. VLSI Signal Process..

[9]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[10]  John G. Beerends,et al.  The Influence of Video Quality on Perceived Audio Quality and Vice Versa , 1999 .

[11]  Alistair Conkie A robust unit selection system for speech synthesis , 1999 .

[12]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[13]  Jörn Ostermann,et al.  Talking heads and synthetic speech: an architecture for supporting electronic commerce , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).