Identification of contrast and its emphatic realization in HMM based speech synthesis

The work presented in this paper proposes to identify contrast in the form of contrastive word pairs and prosodically signal it with emphatic accents in a Text-to-Speech (TTS) application using a Hidden-Markov-Model (HMM) based speech synthesis system. We first describe a novel method to automatically detect contrastive word pairs using textual features only and report its performance on a corpus of spontaneous conversations in English. Subsequently we describe the set of features selected to train a HMM-based speech synthesis system and attempting to properly control prosodic prominence (including emphasis). Results from a large scale perceptual test show that in the majority of cases listeners judge emphatic contrastive word pairs as acceptable as their non-emphatic counterpart, while emphasis on non-contrastive pairs is almost never acceptable. Index Terms: prosody, contrast, hmm speech synthesis

[1]  Michael White,et al.  Synthesising contextually appropriate intonation in limited domains , 2004, SSW.

[2]  Justine Cassell,et al.  Semantic and Discourse Information for Text-to-Speech Intonation , 1997, Workshop On Concept To Speech Generation Systems.

[3]  J. F. Pitrelli,et al.  Expressive speech synthesis using American English ToBI: questions and contrastive emphasis , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[4]  Leonardo Badino,et al.  Automatic labeling of contrastive word pairs from spontaneous spoken english , 2008, 2008 IEEE Spoken Language Technology Workshop.

[5]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[6]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[7]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[9]  Simon King,et al.  Modelling prominence and emphasis improves unit-selection synthesis , 2007, INTERSPEECH.

[10]  Philip R. Cohen,et al.  Intentions in Communication , 1991, CL.

[11]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[12]  Mark Steedman,et al.  Specifying intonation from context for speech synthesis , 1994, Speech Communication.

[13]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[15]  Mark Steedman,et al.  A Framework for Annotating Information Structure in Discourse , 2005, FCA@ACL.

[16]  Stephen E. Levinson,et al.  Extraction of pragmatic and semantic salience from spontaneous spoken English , 2006, Speech Commun..

[17]  Volker Strom,et al.  Including pitch accent optionality in unit selection text-to-speech synthesis , 2008, INTERSPEECH.

[18]  Emiel Krahmer,et al.  On the alleged existence of contrastive accents , 2001, Speech Commun..

[19]  Shrikanth Narayanan,et al.  Detecting prominence in conversational speech: pitch accent, givenness and focus , 2008, Speech Prosody 2008.