Joint spectral distribution modeling using restricted boltzmann machines for voice conversion

This paper presents a new spectral modeling and conversion method for voice conversion. In contrast to the conventional Gaussian mixture model (GMM) based methods, we use restricted Boltzmann machines (RBMs) as probability density models to model the joint distributions of source and target spectral features. The Gaussian distribution in each mixture of GMM is replaced by an RBM, which can better capture the inter-dimensional and inter-speaker correlations within the joint spectral features. Spectral conversion is performed by the maximum conditional output probability criterion. Our experimental results show that the similarity and naturalness of the proposed method are significantly improved comparing with the conventional GMM based method.

[1]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Ruslan Salakhutdinov,et al.  Learning Deep Generative Models , 2009 .

[3]  Dong Yu,et al.  Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Daniel Erro,et al.  Weighted frequency warping for voice conversion , 2007, INTERSPEECH.

[5]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[6]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[7]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[8]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[9]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  H. Zen,et al.  Continuous Stochastic Feature Mapping Based on Trajectory HMMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..