On combining statistical methods and frequency warping for high-quality voice conversion

In current voice conversion systems, obtaining a high similarity between converted and target voices requires a high degree of signal manipulation, which implies important quality degradation, up to the point that in some cases the quality scores are unacceptable for real-life applications. Indeed, a tradeoff can be observed between the similarity scores and the quality scores achieved by a given voice conversion system. In our previous works we proved that statistical methods and frequency warping transformations could be combined to yield a better similarity-quality balance than conventional systems, due to significant quality improvements. In this paper, two different ways of combining these two approaches are compared through perceptual tests in order to determine the best strategy for high-quality voice conversion. The comparison is made under the same training conditions, using the same speech model and vector dimensions. The results indicate that the Weighted Frequency Warping method is preferred by listeners.

[1]  Hermann Ney,et al.  TC-Star: Cross-Language Voice Conversion Revisited , 2006 .

[2]  Antonio Bonafonte,et al.  Residual Conversion Versus Prediction on Voice Morphing Systems , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  H. Hoge,et al.  Residual prediction , 2005, Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, 2005..

[5]  Daniel Erro,et al.  Flexible harmonic/stochastic speech synthesis , 2007, SSW.

[6]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[7]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Levent M. Arslan,et al.  Speaker Transformation Algorithm using Segmental Codebooks (STASC) , 1999, Speech Commun..

[9]  K. Shikano,et al.  Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Hermann Ney,et al.  Text-independent cross-language voice conversion , 2006, INTERSPEECH.

[11]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[12]  Hui Ye,et al.  Quality-enhanced voice morphing using maximum likelihood transformations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[14]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[15]  Daniel Erro,et al.  Weighted frequency warping for voice conversion , 2007, INTERSPEECH.