Voice conversion using conditional restricted Boltzmann machine

In this paper, we proposed a new method for voice conversion using conditional restricted Boltzmann machine (Conditional RBM, CRBM). The joint distribution of source and target acoustic features are modeled by the RBM part of the model. Short-term temporal constraints are introduced by conditioning on contextual frames, say, the past and future frames of the source speaker. In contrast to conventional methods, temporal structure of the data could be modeled without using dynamic features. Objective and subjective experiments were conducted to evaluate the method. Experimental results show that short-term temporal structure could be modeled well by CRBM, and the proposed method outperforms conventional joint density Gaussian mixture models based method significantly.

[1]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[2]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[3]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[4]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[5]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[6]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[7]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Tetsuya Takiguchi,et al.  Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[10]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[11]  David J. Fleet,et al.  Dynamical binary latent variable models for 3D human pose tracking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[13]  Li-Rong Dai,et al.  Joint spectral distribution modeling using restricted boltzmann machines for voice conversion , 2013, INTERSPEECH.

[14]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[15]  Geoffrey E. Hinton,et al.  Phone recognition using Restricted Boltzmann Machines , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.