DNN-based Bilingual (Telugu-Hindi) Polyglot Speech Synthesis

Bilingual polyglot speech synthesis refers to the process of synthesizing two languages with same voice, using a single speech synthesizer. In India, people often mix Hindi with their mother tongue in everyday conversations. Hence, text-to-speech (TTS) systems which can handle bilingual text are required. In this work, a deep neural network (DNN)-based bilingual speech synthesis system is developed for Telugu and Hindi, using a polyglot speech corpus, collected from a bilingual female speaker. The performance of the developed DNN-based bilingual synthesizer is evaluated and compared with the HMM-based one by means of preference test and mean opinion score (MOS) test. The results show that the quality of DNN-based synthesizer is considerably better compared to HMM-based synthesizer with an average MOS score of 3.59.

[1]  T. Nagarajan,et al.  Analysis on acoustic similarities between Tamil and English phonemes using product of likelihood-Gaussians for an HMM-based mixed-language synthesizer , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[2]  Frank K. Soong,et al.  HMM-Based Mixed-Language (Mandarin-English) Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[3]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  K. Sreenivasa Rao,et al.  Inverse filter based excitation model for HMM-based speech synthesis system , 2018, IET Signal Process..

[5]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[6]  Yoshihiko Nankaku,et al.  The effect of neural networks in statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  K. Sreenivasa Rao,et al.  Robust Pitch Extraction Method for the HMM-Based Speech Synthesis System , 2017, IEEE Signal Processing Letters.

[8]  France Mihelic,et al.  A Bilingual HMM-Based Speech Synthesis System for Closely Related Languages , 2012, TSD.

[9]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Hema A. Murthy,et al.  A common attribute based unified HTS framework for speech synthesis in Indian languages , 2013, SSW.

[11]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).