VTLN-Based Rapid Cross-Lingual Adaptation for Statistical Parametric Speech Synthesis

Cross-lingual speaker adaptation (CLSA) has emerged as a new challenge in statistical parametric speech syn- thesis, with specific application to speech-to-speech translation. Recent research has shown that reasonable speaker similarity can be achieved in CLSA using maximum likelihood linear transformation of model parameters, but this method also has weaknesses due to the inherent mismatch caused by differing phonetic inventories of languages. In this paper, we propose that fast and effective CLSA can be made using vocal tract length normalization (VTLN), where strong constraints of the vocal tract warping function may actually help to avoid the most severe effects of the aforementioned mismatch. VTLN has a single parameter that warps spectrum. Using shifted or adapted pitch, VTLN can still achieve reasonable speaker similarity. We present our approach, VTLN-based CLSA, and evaluation results that support our proposal under the limitation that the voice identity and speaking style of a target speaker don’t diverge too far from that of the average voice model.

[1]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[2]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[3]  Mirjam Wester,et al.  The EMIME Mandarin Bilingual Database , 2011 .

[4]  M Gibson,et al.  Unsupervised Intralingual and Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis Using Two-Pass Decision Tree Construction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Abeer Alwan,et al.  Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC , 2009, Comput. Speech Lang..

[6]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[7]  Oliver Watts,et al.  Roles of the average voice in speaker-adaptive HMM-based speech synthesis , 2010, INTERSPEECH.

[8]  Keiichi Tokuda,et al.  Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Hui Liang,et al.  Implementation of VTLN for statistical speech synthesis , 2010, SSW.

[10]  Hui Liang,et al.  VTLN adaptation for statistical speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Heiga Zen,et al.  Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Hui Liang,et al.  An analysis of language mismatch in HMM state mapping-based cross-lingual speaker adaptation , 2010, INTERSPEECH.

[13]  Hui Liang,et al.  Cross-Lingual Speaker Discrimination Using Natural and Synthetic Speech , 2011, INTERSPEECH.

[14]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[15]  Srinivasan Umesh,et al.  A computationally efficient approach to warp factor estimation in VTLN using EM algorithm and sufficient statistics , 2008, INTERSPEECH.

[16]  Tomoki Toda,et al.  Acoustic compensation methods for body transmitted speech conversion , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Hui Liang,et al.  A comparison of supervised and unsupervised cross-lingual speaker adaptation approaches for HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Yoshihiko Nankaku,et al.  State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis , 2009, INTERSPEECH.

[19]  Chin-Hui Lee,et al.  Matching for Robust Speech Rec , 1996 .