Voice conversion with linear prediction residual estimaton

The work presented here shows a comparison between a voice conversion system based on converting only the vocal tract representation of the source speaker and an augmented system that adds an algorithm for estimating the target excitation signal. The estimation algorithm uses a stochastic model for relating the excitation signal to the vocal tract features. The two systems were subjected to objective and subjective tests for assessing the effectiveness of the perceived identity conversion and the overall quality of the synthesized speech. Male-to-male and female-to- female conversion cases were tested. The main objective of this work is to improve the recognizability of the converted speech while maintaining a high synthesis quality.

[1]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Jun Sun,et al.  Modeling Glottal Source for High Quality Voice Conversion , 2006, 2006 6th World Congress on Intelligent Control and Automation.

[3]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[4]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[5]  Levent M. Arslan,et al.  Subband based voice conversion , 2002, INTERSPEECH.

[6]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[7]  Levent M. Arslan,et al.  Speaker Transformation Algorithm using Segmental Codebooks (STASC) , 1999, Speech Commun..

[9]  Hui Ye,et al.  High quality voice morphing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  David Sündermann Voice Conversion : State-ofthe-Art and Future Work , 2009 .