A trainable prosodic model: learning the contours implementing communicative functions within a superpositional model of intonation

This paper introduces a new model-constrained, datadriven method to generate prosody from metalinguistic information. We refer here to the general ability of intonation to demarcate speech units and convey information about the propositional and interactional functions of these units within the discourse. Our strong hypotheses are that (1) these functions are directly implemented as prototypical prosodic contours that are coextensive to the unit(s) they apply to, (2) the prosody of the message is obtained by superposing and adding all the contributing contours. We describe here an analysis-by-synthesis scheme that consists in identifying these prototypical contours and separating out their contributions in the prosodic contours of the training data. We will show that such a trainable prosodic model generates faithful prosodic contours with very few prototypical movements.

[1]  Alex I. C. Monaghan Extracting microprosodic information from diphones - a simple way to model segmental effects on prosody for synthetic speech , 1992, ICSLP.

[2]  M. D. Riley Tree-based modeling of segmental durations , 1992 .

[3]  Ove Andersen,et al.  Implications of energy declination for speech synthesis , 1998, SSW.

[4]  V. Aubergé,et al.  Developing a structured lexicon for synthesis of prosody , 1994 .

[5]  Gérard Bailly,et al.  Learning the Hidden Structure of Intonation: Implementing Various Functions of Prosody , 2002 .

[6]  Nick Campbell,et al.  Prosody-based unit selection for Japanese speech synthesis , 1998, SSW.

[7]  D. Hirst The phonology and phonetics of speech prosody: between acoustics and interpretation , 2004, Speech Prosody 2004.

[8]  Stéphanie de Tournemire,et al.  Identification and automatic generation of prosodic contours for a text-to-speech synthesis system in French , 1997, EUROSPEECH.

[9]  Y. Sagisaka,et al.  On the prediction of global F/sub 0/ shape for Japanese text-to-speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  Ann K. Syrdal,et al.  Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis , 2000, INTERSPEECH.

[11]  Gérard Bailly,et al.  Generation of intonation: a global approach , 1995, EUROSPEECH.

[12]  Dicky Gilbers Recursive Patterns in Phonological Phrases , 2004 .

[13]  Paul Taylor,et al.  Speech synthesis by phonological structure matching , 1999, EUROSPEECH.

[14]  Katarina Bartkova,et al.  A model of segmental duration for speech synthesis in French , 1987, Speech Commun..

[15]  Gérard Bailly,et al.  Generation of pauses within the z-score model , 1994, SSW.

[16]  Carlos Gussenhoven,et al.  Prosodic and intonational domains in speech synthesis , 1994, SSW.

[17]  Guillaume Gibert,et al.  Evaluation of a Speech Cuer: From Motion Capture to a Concatenative Text-to-cued Speech System , 2004, LREC.

[18]  D. O'Shaughnessy A study of French vowel and consonant durations , 1981 .

[19]  Christof Traber F0 generation with a data base of natural F0 patterns and with a neural network , 1990, SSW.

[20]  Gérard Bailly,et al.  Generating prosody by superposing multi-parametric overlapping contours , 2000, INTERSPEECH.

[21]  IVAN FÓNAGY,et al.  CLICHÉS MÉLODIQUES , 1983 .

[22]  Jan P. H. van Santen,et al.  Quantitative Modeling of Pitch Accent Alignment , 2002 .

[23]  Carlo Drioli,et al.  Prosodic data driven modelling of a narrative style in Festival TTS , 2004, SSW.

[24]  Gérard Bailly,et al.  Automatic generation of prosody: comparing two superpositional systems , 2004, Speech Prosody 2004.

[25]  Gérard BaillyFebruary No future for comprehensive models of intonation ? , 1996 .

[26]  Jan P. H. van Santen,et al.  Deriving text-to-speech durations from natural speech , 1990, SSW.

[27]  Gérard Bailly,et al.  Characterisation of rhythmic patterns for text-to-speech synthesis , 1994, Speech Communication.

[28]  Hansjörg Mixdorff,et al.  Building an integrated prosodic model of German , 2001, INTERSPEECH.

[29]  Andrej Ljolje,et al.  Synthesis of natural sounding pitch contours in isolated utterances using hidden Markov models , 1986, IEEE Trans. Acoust. Speech Signal Process..

[30]  John N. Gowdy,et al.  Neural network based generation of fundamental frequency contours , 1989, International Conference on Acoustics, Speech, and Signal Processing,.