Analysis on acoustic similarities between Tamil and English phonemes using product of likelihood-Gaussians for an HMM-based mixed-language synthesizer

A mixed-language (polyglot) synthesizer is one that synthesizes intelligible multilingual speech with a single speaker's voice with appropriate pronunciations. Two main requirements of a mixed-language synthesizer are that (i) the transition from one language to another (language switching) and (ii) the influence of one language on another should not be perceivable. In this regard, in [1], while developing a bilingual text-to-speech (TTS) system for Mandarin and English, the minimum Kullback-Leibler divergence(KLD) criterion, applied state-wise to the context-independent hidden Markov models(HMMs) is used to cluster the states of acoustically similar phonemes across the two languages. In the current work, using context-independent HMMs trained separately for two languages, namely, Tamil and English, an attempt has been made to find the acoustically similar phonemes using product of Gaussians (PoG) in the log-likelihood space. A speech corpus, with Tamil and English data, uttered by the same speaker, is used for this task. The quality of the speech synthesized by the mixed-language synthesizer is assessed subjectively, and the mean opinion score of 3.49 is obtained when acoustically similar phonemes alone are merged. In addition, analyses are carried out to find the amount of language switching and the influence of one language on the other.

[1]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Sadaoki Furui,et al.  Polyglot synthesis using a mixture of monolingual corpora , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Hema A Murthy,et al.  Development and evaluation of unit selection and HMM-based speech synthesis systems for Tamil , 2013, 2013 National Conference on Communications (NCC).

[4]  Douglas D. O'Shaughnessy,et al.  Bias Estimation and Correction in a Classifier using Product of Likelihood-Gaussians , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Beat Pfister,et al.  From multilingual to polyglot speech synthesis , 1999, EUROSPEECH.