One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices

This paper describes two flexible frameworks of voice conversion (VC), i.e., one-to-many VC and many-to-one VC. One-to-many VC realizes the conversion from a user's voice as a source to arbitrary target speakers' ones and many-to-one VC realizes the conversion vice versa. We apply eigenvoice conversion (EVC) to both VC frameworks. Using multiple parallel data sets consisting of utterance-pairs of the user and multiple pre-stored speakers, an eigenvoice Gaussian mixture model (EV-GMM) is trained in advance. Unsupervised adaptation of the EV-GMM is available to construct the conversion model for arbitrary target speakers in one-to-many VC or arbitrary source speakers in many-to-one VC using only a small amount of their speech data. Results of various experimental evaluations demonstrate the effectiveness of the proposed VC frameworks.

[1]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[3]  Yoshinori Sagisaka,et al.  Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks , 1995, Speech Commun..

[4]  Keiichi Tokuda,et al.  Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[5]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[6]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[7]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Athanasios Mouchtaris,et al.  Non-parallel training for voice conversion by maximum likelihood constrained adaptation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[10]  C. Zheng,et al.  ; 0 ; , 1951 .

[11]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[12]  Yoshinori Sagisaka,et al.  Acoustic characteristics of speaker individuality: Control and conversion , 1995, Speech Commun..