论文信息 - Duration-embedded bi-HMM for expressive voice conversion

Duration-embedded bi-HMM for expressive voice conversion

This paper presents a duration-embedded Bi-HMM framework for expressive voice conversion. First, Ward’s minimum variance clustering method is used to cluster all the conversion units (sub-syllables) in order to reduce the number of conversion models as well as the size of the required training database. The duration-embedded Bi-HMM trained with the EM algorithm is built for each sub-syllable class to convert the neutral speech into emotional speech considering the duration information. Finally, the prosodic cues are included in the modification of the spectrum-converted speech. The STRAIGHT algorithm is adopted for high-quality speech analysis and synthesis. Target emotions including happiness, sadness and anger are used. Objective and perceptual evaluations were conducted to compare the performance of the proposed approach with previous methods. The results show that the proposed method exhibits encouraging potential in expressive voice conversion.

Chung-Hsien Wu | Chi-Chun Hsia | Te-Hsien Liu

[1] Tomoki Toda,et al. High quality voice conversion based on Gaussian mixture model with dynamic frequency warping , 2001, INTERSPEECH.

[2] Hideki Kawahara,et al. Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3] Antonio Bonafonte,et al. Including dynamic and phonetic information in voice conversion systems , 2004, INTERSPEECH.

[4] Jia Liu,et al. Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[5] Satoshi Nakamura,et al. Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6] Stephen E. Levinson,et al. Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[7] Tomoki Toda,et al. GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[8] Chung-Hsien Wu,et al. Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis , 2001, Speech Commun..

[9] Eric Moulines,et al. Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[10] J. H. Ward. Hierarchical Grouping to Optimize an Objective Function , 1963 .