Local partial least square regression for spectral mapping in voice conversion

Joint density Gaussian mixture model (JD-GMM) based method has been widely used in voice conversion task due to its flexible implementation. However, the statistical averaging effect during estimating the model parameters will result in over-smoothing the target spectral trajectories. Motivated by the local linear transformation method, which uses neighboring data rather than all the training data to estimate the transformation function for each feature vector, we proposed a local partial least square method to avoid the over-smoothing problem of JD-GMM and the over-fitting problem of local linear transformation when training data are limited. We conducted experiments using the VOICES database and measure both spectral distortion and correlation coefficient of the spectral parameter trajectory. The experimental results show that our proposed method obtain better performance as compared to baseline methods.

[1]  Olivier Boëffard,et al.  Pitch and Duration Transformation with Non-parallel Data , 2008 .

[2]  Elina Helander,et al.  A Novel Method for Prosody Prediction in Voice Conversion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[4]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[5]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[6]  Moncef Gabbouj,et al.  Local linear transformation for voice conversion , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Haizhou Li,et al.  Text-independent F0 transformation with non-parallel data for voice conversion , 2010, INTERSPEECH.

[8]  Haizhou Li,et al.  Mixture of Factor Analyzers Using Priors From Non-Parallel Speech for Voice Conversion , 2012, IEEE Signal Processing Letters.

[9]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[12]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[13]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[15]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[18]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.