Multimodal Speech Synthesis

Speech output generation in the SmartKom system is realized by a corpus-based unit selection strategy that preserves many properties of the human voice. When the system’s avatar “Smartakus” is present on the screen, the synthetic speech signal is temporally synchronized with Smartakus visible speech gestures and prosodically adjusted to his pointing gestures to enhance multimodal communication. The unit selection voice was formally evaluated and found to be very well accepted and reasonably intelligible in SmartKom- specific scenarios.

[1]  Steven Abney,et al.  Chunks and Dependencies: Bringing Processing Evidence to Bear on Syntax , 2002 .

[2]  G. Cinque A null theory of phrase and compound stress , 1993 .

[3]  Antje Schweitzer,et al.  Zwei Ansätze zur syntaxgesteuerten Prosodiegenerierung , 2000, KONVENS.

[4]  Jörg Mayer Prosodische Merkmale von Diskursrelationen: 1968 , 1999 .

[5]  H. McGurk,et al.  Visual influences on speech perception processes , 1978, Perception & psychophysics.

[6]  Michael Rochemont,et al.  Stress and Focus in English , 1983 .

[7]  Paul Taylor,et al.  Speech synthesis by phonological structure matching , 1999, EUROSPEECH.

[8]  Antje Schweitzer,et al.  Prosody Generation in the SmartKom Project , 2002 .

[9]  Tilman Becker,et al.  Natural Language Generation with Fully Specified Templates , 2006, SmartKom.

[10]  Antje Schweitzer,et al.  Restricted unlimited domain synthesis , 2003, INTERSPEECH.

[11]  Jan P. H. van Santen,et al.  Methods for optimal text selection , 1997, EUROSPEECH.

[12]  Peter Poller,et al.  Multimodal Fission and Media Design , 2006, SmartKom.

[13]  Silvia Quazza,et al.  Choose the best to modify the least: a new generation concatenative synthesis system , 1999, EUROSPEECH.

[14]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[15]  F. Park ROBUST UNIT SELECTION SYSTEM FOR SPEECH SYNTHESIS , 1999 .

[16]  Peter Jackson,et al.  Non-uniform unit selection and the similarity metric within BT's Laureate TTS system , 1998, SSW.

[17]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[18]  Alistair Conkie A robust unit selection system for speech synthesis , 1999 .

[19]  Westone,et al.  Home Page , 2004, 2022 2nd International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA).

[20]  Paul Taylor,et al.  Concatenative text-to-speech synthesis based on prototype waveform interpolation (a time frequency approach) , 2000, INTERSPEECH.

[21]  Tilman Becker Fully Lexicalized Head-Driven Syntactic Generation , 1998, INLG.

[22]  Daniel Büring,et al.  The Meaning of Topic and Focus: The 59th Street Bridge Accent , 1997 .

[23]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[24]  Petra Wagner,et al.  Speech Synthesis Using Multilevel Selection and Concatenation of Units from Large Speech Corpora , 2000 .

[25]  Florian Schiel Evaluation of Multimodal Dialogue Systems , 2006, SmartKom.

[26]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[27]  Bernd Möbius,et al.  Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis , 2003, Int. J. Speech Technol..

[28]  Daniel Hirst Detaching intonational phrases from syntactic structure , 1993 .

[29]  J. D. Ruiter The production of gesture and speech , 2000 .

[30]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[31]  Mehryar Mohri,et al.  Rapid unit selection from a large speech corpus for concatenative speech synthesis , 1999, EUROSPEECH.

[32]  Elisabeth Selkirk,et al.  Phonology and Syntax: The Relation between Sound and Structure , 1984 .

[33]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[34]  Gregor Möhler,et al.  Parametric modeling of intonation using vector quantization , 1998, SSW.

[35]  Thierry Dutoit,et al.  The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[36]  D. Massaro Perceiving talking faces: from speech perception to a behavioral principle , 1999 .