Transform mapping using shared decision tree context clustering for HMM-based cross-lingual speech synthesis

This paper proposes a novel transform mapping technique based on shared decision tree context clustering (STC) for HMMbased cross-lingual speech synthesis. In the conventional crosslingual speaker adaptation based on state mapping, the adaptation performance is not always satisfactory when there are mismatches of languages and speakers between the average voice models of input and output languages. In the proposed technique, we alleviate the effect of the mismatches on the transform mapping by introducing a language-independent decision tree constructed by STC, and represent the average voice models using language-independent and dependent tree structures. We also use a bilingual speech corpus for keeping speaker characteristics between the average voice models of different languages. The experimental results show that the proposed technique decreases both spectral and prosodic distortions between original and generated parameter trajectories and significantly improves the naturalness of synthetic speech while keeping the speaker similarity compared to the state mapping.

[1]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2]  Masatsune Tamura,et al.  A Context Clustering Technique for Average Voice Models , 2003 .

[3]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Takao Kobayashi,et al.  Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis , 2006, INTERSPEECH.

[5]  Hui Liang,et al.  An analysis of language mismatch in HMM state mapping-based cross-lingual speaker adaptation , 2010, INTERSPEECH.

[6]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[7]  Yoshihiko Nankaku,et al.  Cross-lingual speaker adaptation for HMM-based speech synthesis considering differences between language-dependent average voices , 2010 .

[8]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[9]  Frank K. Soong,et al.  State mapping for cross-language speaker adaptation in TTS , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Satoshi Nakamura,et al.  Multilingual Speech-to-Speech Translation System: VoiceTra , 2013, 2013 IEEE 14th International Conference on Mobile Data Management.

[11]  Sadaoki Furui,et al.  New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer , 2006, Speech Commun..

[12]  Yoshihiko Nankaku,et al.  State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis , 2009, INTERSPEECH.

[13]  Takao Kobayashi,et al.  MLLR adaptation for hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[14]  Frank K. Soong,et al.  A cross-language state mapping approach to bilingual (Mandarin-English) TTS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Michael Picheny,et al.  Use of statistical N-gram models in natural language generation for machine translation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Keiichi Tokuda,et al.  Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping , 2012, Speech Commun..