Voice conversion for arbitrary speakers using articulatory-movement to vocal-tract parameter mapping

In this paper, we propose voice conversion based on articulatory-movement (AM) to vocal tract parameter (VTP) mapping. An artificial neural network (ANN) is applied to map AM to VTP and to convert the source speaker's voice to the target speaker's voice. The proposed system is not only text independent voice conversion, but can also be used for an arbitrary source speaker. This means that our approach requires no source speaker data to build the voice conversion model and hence source speaker data is only required during testing phase. Preliminary cross-lingual voice conversion experiments are also conducted. The results of voice conversion were evaluated using subjective and objective measures to compare the performance of our proposed ANN-based voice conversion (VC) with the state-of-the-art Gaussian mixture model (GMM)-based VC. The experimental results show that the converted voice is intelligible and has speaker individuality of the target speaker.

[1]  Tomoki Toda,et al.  One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Mohammad Nurul Huda,et al.  Distinctive Phonetic Feature (DPF) Extraction Based on MLNs and Inhibition/Enhancement Network , 2009, IEICE Trans. Inf. Syst..

[3]  Tsuneo Nitta Feature extraction for speech recognition based on orthogonal acoustic-feature planes and LDA , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[4]  Alexander Kain,et al.  Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Tsuneo Nitta,et al.  One-model speech recognition and synthesis based on articulatory movement HMMs , 2010, INTERSPEECH.

[6]  Keiichi Tokuda,et al.  Acoustic-to-articulatory inversion mapping with Gaussian mixture model , 2004, INTERSPEECH.

[7]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Athanasios Mouchtaris,et al.  Non-parallel training for voice conversion by maximum likelihood constrained adaptation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[10]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Eric Moulines,et al.  Statistical methods for voice quality transformation , 1995, EUROSPEECH.

[12]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[13]  Takashi Fukuda,et al.  Distinctive phonetic feature extraction for robust speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Hermann Ney,et al.  Text-independent cross-language voice conversion , 2006, INTERSPEECH.