Map-based adaptation for speech conversion using adaptation data selection and non-parallel training

This study presents an approach to GMM-based speech conversion using maximum a posteriori probability (MAP) adaptation. First, a conversion function is trained using a parallel corpus containing the same utterances spoken by both the source and the reference speakers. Then a non-parallel corpus from a new target speaker is used for the adaptation of the conversion function which models the voice conversion between the source speaker and the new target speaker. The consistency among the adaptation data is estimated to select suitable data from the nonparallel corpus for MAP-based adaptation of the GMMs. In speech conversion evaluation, experimental results show that MAP adaptation using a small non-parallel corpus can reduce the conversion error and improve the speech quality for speaker identification compared to the method without adaptation. Objective and subjective tests also confirm the promising performance of the proposed approach.

[1]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Tomoki Toda,et al.  Evaluation of cross-language voice conversion based on GMM and straight , 2001, INTERSPEECH.

[3]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[4]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[5]  K. Shikano,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[7]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[8]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[9]  Antonio Bonafonte,et al.  Including dynamic and phonetic information in voice conversion systems , 2004, INTERSPEECH.

[10]  Hagai Aronowitz,et al.  A distance measure between GMMs based on the unscented transform and its application to speaker recognition , 2005, INTERSPEECH.

[11]  Ashish Verma,et al.  Using phone and diphone based acoustic models for voice conversion: a step towards creating voice fonts , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..