Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis

This paper describes a method for determining the vocal tract spectrum from articulatory movements using a Gaussian Mixture Model (GMM) to synthesize speech with articulatory information. The GMM on joint probability density of articulatory parameters and acoustic spectral parameters is trained using a parallel acousticarticulatory speech database. We evaluate the performance of the GMM-based mapping by a spectral distortion measure. Experimental results demonstrate that the distortion can be reduced by using not only the articulatory parameters of the vocal tract but also power and voicing information as input features. Moreover, in order to determine the best mapping, we apply maximum likelihood estimation (MLE) to the GMM-based mapping method. Experimental results show that MLE using both static and dynamic features can improve the mapping accuracy compared with the conventional GMM-based mapping.

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[3]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[4]  Juergen Schroeter,et al.  Speech coding based on physiological models of speech production , 1992 .

[5]  広瀬 啓吉,et al.  日本語話者の再生した英単語アクセントの音響的特徴の分析(国際ワークショップ:Speech dynamics by Ear, Eye, Mouth and Machine) , 2003 .

[6]  Mohan Sondhi Articulatory modeling: a possible role in concatenative text-to-speech synthesis , 2002 .

[7]  Alan W. Black,et al.  Prosody and the Selection of Source Units for Concatenative Synthesis , 1997 .

[8]  Masaaki Honda,et al.  Determination of articulatory movements from speech acoustics using an HMM-based speech production model , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Simon King,et al.  Estimating the spectral envelope of voiced speech using multi-frame analysis , 2003, INTERSPEECH.

[10]  Matthew J. Makashay,et al.  Corpus-based techniques in the AT&t nextgen synthesis system , 2000, INTERSPEECH.

[11]  Masaaki Honda,et al.  Acoustic-to-articulatory inverse mapping using an HMM-based speech production model , 2002, INTERSPEECH.

[12]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[13]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[14]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[15]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[16]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[17]  Masaaki Honda,et al.  Determination of the vocal tract spectrum from the articulatory movements based on the search of an articulatory-acoustic database , 1998, ICSLP.