A Voice Morphing Model Based on the Gaussian Mixture Model and Generative Topographic Mapping

In this paper, a new model for voice morphing is proposed. The spectral characteristics of a source speaker’s speech have been transferred to speech as it was spoken by another designated target speaker. The proposed model performs a phoneme segmentation of the voice signal and then transforms the spectral characteristics of each segment using a Linear Prediction model. The spectral features extracted using the Linear Prediction Coding (LPC) technique are aligned using the Dynamic Time Wrapping (DTW). The Generative Topographic Mapping (GTM) method was used for modeling the LPC features. Then, the transformation is achieved using the Gaussian Mixture Model (GMM). The transformed code-books are finally converted to prediction coefficients, and the excitation signal is filtered in order to synthesis the speech. A correlation test is performed between the source, and target signals showed a high correlation. The results reveal that the proposed model is promising in terms of recognizing full sentences in addition to individual words.

[1]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[2]  Alan McCree,et al.  Low-Bit-Rate Speech Coding , 2008 .

[3]  Stephen J. Roberts,et al.  Voice morphing using the generative topographic mapping , 2003 .

[4]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Carlo Drioli Radial Basis Function Networks for Conversion of Sound Spectra , 2001, EURASIP J. Adv. Signal Process..

[6]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[7]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[8]  Ronald A. Cole,et al.  Automatic time alignment of phonemes using acoustic-phonetic information , 2000 .

[9]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.