Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description

Voice conversion, i.e. modification of a speech signal to sound as if spoken by a different speaker, finds its use in speech synthesis with a new voice without necessity of a new database. This paper introduces two new simple non-linear methods of frequency scale mapping for transformation of voice characteristics between male and female or childish. The frequency scale mapping methods were developed primarily for use in the Czech and Slovak text-to-speech (TTS) system designed for the blind and based on the Pocket PC device platform. It uses cepstral description of the diphone speech inventory of the male speaker using the source-filter speech model or the harmonic speech model. Three new diphone speech inventories corresponding to female, childish and young male voices are created from the original male speech inventory. Listening tests are used for evaluation of voice transformation and quality of synthetic speech.

[1]  Michael Unser,et al.  Splines: a perfect fit for signal and image processing , 1999, IEEE Signal Process. Mag..

[2]  Janet Slifka,et al.  Speaker modification with LPC pole analysis , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[4]  Antonio Bonafonte,et al.  Estimation of GMM in voice conversion including unaligned data , 2003, INTERSPEECH.

[5]  Masanobu Abe,et al.  Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt , 1995, Speech Commun..

[6]  K. Shikano,et al.  Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Yoshinori Sagisaka,et al.  Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks , 1995, Speech Commun..

[8]  Martin Vondra,et al.  Speech Identity Conversion , 2004, Summer School on Neural Networks.

[9]  Malcolm J. Crocker,et al.  Encyclopedia of Acoustics , 1998 .

[10]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Gunnar Fant,et al.  Acoustical Analysis of Speech , 2007 .

[12]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[13]  Levent M. Arslan,et al.  Subband based voice conversion , 2002, INTERSPEECH.

[14]  Hermann Ney,et al.  Voice Conversion Using Exclusively Unaligned Training Data , 2004, Proces. del Leng. Natural.

[15]  Saeed Vaseghi,et al.  Transformation of speaker characteristics for voice conversion , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[16]  Yoshinori Sagisaka,et al.  Acoustic characteristics of speaker individuality: Control and conversion , 1995, Speech Commun..

[17]  Levent M. Arslan,et al.  Speaker Transformation Algorithm using Segmental Codebooks (STASC) , 1999, Speech Commun..

[18]  Athanasios Mouchtaris,et al.  Non-parallel training for voice conversion by maximum likelihood constrained adaptation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Rubén San-Segundo-Hernández,et al.  A new multi-speaker formant synthesizer that applies voice conversion techniques , 2001, INTERSPEECH.

[20]  Chin-W. Kim,et al.  Models of Speech Production , 1972, Formal Aspects of Cognitive Processes.

[21]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[22]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[23]  이기승,et al.  낮은 차원의 벡터 변환을 통한 음성 변환 ( Voice Conversion Using Low Dimensional Vector Mapping ) , 1998 .

[24]  Hui Ye,et al.  Perceptually weighted linear transformations for voice conversion , 2003, INTERSPEECH.

[25]  Satoshi Imai Low bit rate cepstral vocoder using the log magnitude approximation filter , 1978, ICASSP.