Frame correlation based autoregressive GMM method for voice conversion

In this paper, we present a frame correlation based autoregressive GMM method for voice conversion. In our system, the cross-frame correlation of the source feature is modeled with augmented delta features, and the cross-frame correlation of target feature is modeled by autoregressive models. The expectation maximization (EM) algorithm is used for the model training, and a maximum likelihood parameter conversion algorithm is then employed to convert the feature of a source speaker into the one of a target speaker frame by frame. This method is consistent in training and conversion by using target feature's cross-frame correlation explicitly at both stage. The experimental results show that the proposed method has excellent performance. The test set log probability of it is higher than the GMM-DYN (GMM with dynamic features) method, and the subjective evaluation results of it are also comparable to the GMM-DYN method. Furthermore, it is much more suitable for low-latency application.

[1]  Li-Rong Dai,et al.  GMM-based voice conversion with explicit modelling on feature transform , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[2]  H. Zen,et al.  Continuous Stochastic Feature Mapping Based on Trajectory HMMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Heiga Zen,et al.  Autoregressive Models for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  Heiga Zen,et al.  The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[7]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..