A cross-language state mapping approach to bilingual (Mandarin-English) TTS

We propose a cross-language state mapping approach to HMM-based bilingual TTS. Two language-dependent decision trees are built first with a bilingual speech database recorded by a single speaker. A state mapping for every leaf node in the decision tree of a target language is created by finding the nearest leaf node in the tree of a source language. Kullback-Leibler divergence between two distributions is used to find the nearest leaf node. To synthesize target language speech by a monolingual, (source language) speaker's voice, we find HMM parameters trained by the monolingual (source language) speaker in the mapped leaf nodes. Similar mappings can be constructed by reversing the source and target languages. With these bi-directional cross-lingual mappings, we can synthesize bilingual or mixed-code speech by HMMs trained by any monolingual speaker. High voice (speaker) similarity is preserved in synthesized speech of the target language. Two perceptual tests on synthesized Mandarin speech confirms high intelligibility with a Chinese character transcription accuracy of 92.1% and an MOS score of 3.08.

[1]  Sadaoki Furui,et al.  Polyglot synthesis using a mixture of monolingual corpora , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[3]  Frank K. Soong,et al.  Optimal clustering of multivariate normal distributions using divergence and its application to HMM adaptation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Frank K. Soong,et al.  Measuring attribute dissimilarity with HMM KL-divergence for speech synthesis , 2007, SSW.

[5]  Richard Sproat,et al.  Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.

[6]  Claudia Barolo,et al.  Language independent phoneme mapping for foreign TTS , 2004, SSW.

[7]  Yong Zhao,et al.  Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[9]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Frank K. Soong,et al.  An HMM-based bilingual (Mandarin-English) TTS , 2007, SSW.

[11]  Nick Campbell TALKING FOREIGN - concatenative speech synthesis and the language barrier , 2001, INTERSPEECH.

[12]  Silvia Quazza,et al.  ACTOR: A multilingual unit-selection speech synthesis system , 2001, SSW.

[13]  Alan W. Black,et al.  Multilingual text-to-speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Jan Odijk,et al.  Introduction to multilingual corpus-based concatenative speech synthesis , 2007, INTERSPEECH.