Intonation Conversion from Neutral to Expressive Speech

Intonation is one of the most important factors of speech expressivity. This paper presents a conversion method for the F0 contours. The F0 segments are represented with discrete cosine transform (DCT) coefficients at the syllable level. Multi-level dynamic features are added to model the temporal correlation between syllables and to constrain the F0 contour at the phrase level. Gaussian mixture models (GMM) are used to map the prosodic features between neutral and expressive speech, and the converted F0 contour is generated under the dynamic features constraints. Experimental evaluation using a database of acted emotional speech shows the effectiveness of the proposed F0 model and conversion method.

[1]  Elina Helander,et al.  A Novel Method for Prosody Prediction in Voice Conversion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Steve J. Young,et al.  Intonation Modelling and Adaptation for Emotional Prosody Generation , 2005, ACII.

[7]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[8]  Axel Röbel,et al.  Shape-invariant speech transformation with the phase vocoder , 2010, INTERSPEECH.

[9]  Xavier Rodet,et al.  Automatic Phoneme Segmentation with Relaxed Textual Constraints , 2008, LREC.

[10]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[11]  Esther Klabbers,et al.  Estimating phrase curves in the general superpositional intonation model , 2004, SSW.

[12]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.