High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin

A novel voice conversion system using phoneme-based linear mapping functions on main vowel phonemes is proposed in this paper. Our voice conversion algorithm has the following three improvements. First, instead of using all the vocal tract resonance (VTR) vectors in the portion of a phoneme, we use the VTR vector at the steady-state of each phoneme to train phoneme-based GMM. Second, different linear mapping functions have been trained to describe the mapping relationships for corresponding phonemes. Third, in the transformation procedure, the transformed formant frequencies at the main vowel phonemes are obtained using the corresponding GMM. Besides, prosody parameters are also transformed. Finally the converted speech is re-synthesized with the transformed parameters by high quality speech manipulation framework STRAIGHT (Speech Transformation and Representation based on Adaptive Interpolation of weiGHTed spectrogram). Perceptual results for F-M and M-F conversion show that our MOS score of the converted voice is improved from 3.8 to 4.1 and ABX score from 3.3 to 3.8 compared with IBM's system. Comparisons with other systems are also given in this paper.

[1]  H. Ney,et al.  VTLN-based cross-language voice conversion , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[2]  Carlo Drioli,et al.  Sound Morphing With Gaussian Mixture Models , 2001 .

[3]  Hideki Kawahara,et al.  Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Tomoki Toda,et al.  High quality voice conversion based on Gaussian mixture model with dynamic frequency warping , 2001, INTERSPEECH.

[5]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[6]  Levent M. Arslan,et al.  Speaker Transformation Algorithm using Segmental Codebooks (STASC) , 1999, Speech Commun..

[7]  Yonghong Yan,et al.  Mandarin Accent Analysis Based on Formant Frequencies , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Levent M. Arslan,et al.  Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum , 1997, EUROSPEECH.

[9]  K. Shikano,et al.  Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Alexander Kain,et al.  Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Yannis Stylianou,et al.  A system for voice conversion based on probabilistic classification and a harmonic plus noise model , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Yung-Hwan Oh,et al.  Hidden Markov model based voice conversion using dynamic characteristics of speaker , 1997, EUROSPEECH.

[13]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[14]  Hui Ye,et al.  High quality voice morphing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Kazuya Takeda,et al.  Speaker conversion through non-linear frequency warping of straight spectrum , 1999, EUROSPEECH.

[17]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[18]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Chao Huang,et al.  Large vocabulary Mandarin speech recognition with different approaches in modeling tones , 2000, INTERSPEECH.

[20]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..