A Framework for Cross-Lingual Voice Conversion using Articial Neural Networks

Voice Conversion (VC) is a task of transforming an utterance of a source speaker so that it is perceived as if spoken by a specied target speaker. A typical requirement in a VC system is to have a set of utterances recorded by both the speakers (called as parallel data), which is not always feasible. Further, in a cross-lingual voice conversion system, where the source and the target speaker’s language is different, it is impossible to have a parallel set of utterances. Hence, it is important to design an algorithm which performs a source speaker independent training. We propose a framework which captures speakerspecic characteristics and thus avoid the need for any training utterance from the source speaker. The proposed framework exploits the mapping abilities of Articial Neural Networks (ANN) to estimate the conversion function. Experimental results reveal that the quality of the transformed speech is intelligible and has the characteristics of the target speaker.

[1]  Alexander Kain,et al.  Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Tomoki Toda,et al.  One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Keiichi Tokuda,et al.  Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis , 2004, SSW.

[5]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[6]  Bayya Yegnanarayana,et al.  Speaker-specific mapping for text-independent speaker recognition , 2003, Speech Commun..

[7]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[8]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[9]  J.-P. Haton,et al.  Nonlinear vectorial interpolation for speaker recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[11]  Arthur R. Toth,et al.  Incorporating durational modification in voice transformation , 2008, INTERSPEECH.

[12]  Hermann Ney,et al.  Voice Conversion Using Exclusively Unaligned Training Data , 2004, Proces. del Leng. Natural.

[13]  Yoshihisa Ishida,et al.  Transformation of spectral envelope for voice conversion based on radial basis function networks , 2002, INTERSPEECH.

[14]  Athanasios Mouchtaris,et al.  Non-parallel training for voice conversion by maximum likelihood constrained adaptation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Hermann Ney,et al.  Text-Independent Voice Conversion Based on Unit Selection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  K. Sreenivasa Rao,et al.  Voice conversion by mapping the speaker-specific features using pitch synchronous approach , 2010, Comput. Speech Lang..

[17]  Eric Moulines,et al.  Statistical methods for voice quality transformation , 1995, EUROSPEECH.

[18]  Keiichi Tokuda,et al.  Acoustic-to-articulatory inversion mapping with Gaussian mixture model , 2004, INTERSPEECH.

[19]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[20]  H. Ney,et al.  VTLN-based cross-language voice conversion , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[21]  Harald Höge,et al.  Applying VTLN to Residuals ITG-Fachtagung Sprachkommunikation 2006 Breaking a Paradox : Applying VTLN to Residuals , 2006 .

[22]  Helenca Duxans Barrobes Voice conversion applied to text-to-speech systems , 2006 .

[23]  Hui Ye,et al.  Voice conversion for unknown speakers , 2004, INTERSPEECH.

[24]  B. Yegnanarayana,et al.  Artificial Neural Networks , 2004 .

[25]  Arthur R. Toth,et al.  Using articulatory position data in voice transformation , 2007, SSW.

[26]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[27]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[28]  Zhen Yang,et al.  Voice Conversion Without Parallel Speech Corpus Based on Mixtures of Linear Transform , 2007, 2007 International Conference on Wireless Communications, Networking and Mobile Computing.