Speaker adaptive training for one-to-many eigenvoice conversion based on Gaussian mixture model

One-to-many eigenvoice conversion (EVC) allows the conversion of a specific source speaker into arbitrary target speakers. Eigenvoice Gaussian mixture model (EV-GMM) is trained in advance with multiple parallel data sets consisting of the source speaker and many pre-stored target speakers. The EV-GMM is adapted for arbitrary target speakers using only a few utterances by estimating a small number of free parameters. Therefore, the initial EV-GMM directly affects the conversion performance of the adapted EV-GMM. In order to prepare a better initial model, this paper proposes Speaker Adaptive Training (SAT) of a canonical EV-GMM in one-to-many EVC. Results of objective and subjective evaluations demonstrate that SAT causes significant improvements in the performance of EVC. Index Terms: Speech synthesis, Voice conversion, Eigenvoice, Gaussian mixture model, Speaker adaptive training

[1]  Tomoki Toda,et al.  One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[3]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[5]  Kiyohiro Shikano,et al.  Cross-language Voice Conversion Evaluation Using Bilingual Databases , 2002 .

[6]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[7]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[8]  K. Shikano,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[9]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[10]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[12]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[13]  K. Shikano,et al.  Statistical analysis of bilingual speaker's speech for cross-language voice conversion. , 1991, The Journal of the Acoustical Society of America.