Spectral conversion based on statistical models including time-sequence matching

This paper proposes a spectral conversion technique based on a new statisticalmodel which includes time-sequence matching. In conventional GMM-based approaches, the Dynamic Programming (DP) matching between source and target feature sequences is performed prior to the training of GMMs. Although a similarity measure of two frames, e.g., the Euclid distance is typically adopted, this might be inappropriate for converting the spectral features. The likelihood function of the proposed model can directly deal with two different length sequences, in which a frame alignment of source and target feature sequences is represented by discrete hidden variables. In the proposed algorithm, the maximum likelihood criterion is consistently applied to the training of model parameters, sequence matching and spectral conversion. In the subjective preference test, the proposed method is superior than the conventionalGMM-based method.

[1]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[2]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[3]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[4]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[6]  Zoubin Ghahramani Gatsby On Structured Variational Approximations , 1997 .

[7]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).