Voice transformation using PSOLA technique

Abstract In this contribution, a new system for voice conversion is described. The proposed architecture combines a PSOLA (Pitch Synchronous Overlap and Add)-derived synthesizer and a module for spectral transformation. The synthesizer based on the classical source-filter decomposition allows prosodic and spectral transformations to be performed independently. Prosodic modifications are applied on the excitation signal using the TD-PSOLA scheme; converted speech is then synthesized using the transformed spectral parameters. Two different approaches to derive spectral transformations, borrowed from the speech-recognition domain, are compared: Linear Multivariate Regression (LMR) and Dynamic Frequency Warping (DFW). Vector-quantization is carried out as a preliminary stage to render the spectral transformations dependent of the acoustical realization of sounds. A formal listening test shows that the synthesizer produces a satisfyingly natural “transformed” voice. LMR proves yet to allow a slightly better conversion than DFW. Still there is room for improvement in the spectral transformation stage.

[1]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[2]  Kiyohiro Shikano,et al.  Speaker adaptation through vector quantization , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  H. Wakita Normalization of vowels by vocal-tract length and its application to vowel identification , 1977 .

[5]  F. Charpentier Traitement de la parole par analyse-synthese de fourier : application a la synthese par diphones , 1988 .

[6]  J. Vaissière On French prosody , 1974 .

[7]  Hiroshi Matsumoto,et al.  Vowel normalization by frequency warped spectral matching , 1986, Speech Commun..

[8]  J. Makhoul,et al.  Discrete all-pole modeling for voiced speech , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Xavier Rodet,et al.  Generalized functional approximation for source-filter system modeling , 1991, EUROSPEECH.

[10]  M. Abe A segment-based approach to voice conversion , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[11]  Rolf Carlson,et al.  Synthesis: Modeling variability and constraints , 1991, Speech Commun..

[12]  Xavier Rodet,et al.  An Improved Cepstral Method for Deconvolution of Source-Filter Systems with Discrete Spectra: Application to Musical Sound Signals , 1990, ICMC.

[13]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[14]  Werner Verhelst,et al.  Intra-speaker transplantation of speech characteristics an application of waveform vocoding techniques and DTW , 1991, EUROSPEECH.

[15]  Michael Savic,et al.  Voice personality transformation , 1991, Digit. Signal Process..

[16]  D. O'Shaughnessy,et al.  Speaker recognition , 1986, IEEE ASSP Magazine.

[17]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[18]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.