论文信息 - Voice and emotional expression transformation based on statistics of vowel parameters in an emotional speech database

Voice and emotional expression transformation based on statistics of vowel parameters in an emotional speech database

We propose a simple method for modifying emotional speech sounds. The method aims at real-time implementation of an emotional expression transformation system based on STRAIGHT. We developed a mapping function of spectra, fundamental frequencies (F0), and vowel durations from the statistical analysis of 1500 expressive speech sounds in an emotional speech database. The spectral mapping parameters are initially extracted at the centers of vowels and interpolated with bilinear functions. The spectral frequency warping functions are manually designed. The F0 and duration mapping functions simply transform the average values in log frequency and linear time scales. We demonstrate that the spectral distortion is small enough when ‘Neutral’ speech sounds are transformed to expressive speech sounds (i.e. ‘Bright’, ‘Excited’, ‘Angry’, and ‘Raging’ speech sounds).

[1] Iain R. Murray,et al. Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[2] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[3] Marc Schröder,et al. Emotional speech synthesis: a review , 2001, INTERSPEECH.

[4] Tomoki Toda,et al. GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[5] Hideki Kawahara,et al. Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6] Hideki Kawahara,et al. Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system , 2003, INTERSPEECH.

[7] Keiichi Tokuda,et al. Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..