Influence of various asymmetrical contextual factors for TTS in a low resource language

The generalized statistical framework of Hidden Markov Model (HMM) has been successfully applied from the field of speech recognition to speech synthesis. In this paper, we have applied HMM-based Speech Synthesis (HTS) method to Gujarati (one of the official languages of India). Adaption and evaluation of HTS for Gujarati language has been done here. In addition, to understand the influence of asymmetrical contextual factors on quality of synthesized speech, we have conducted series of experiments. Evaluation of different HTS built for Gujarati speech using various asymmetrical contextual factors is done in terms of naturalness and speech intelligibility. From the experimental results, it is evident that when more weightage is given to left phoneme in asymmetrical contextual factor, HTS performance improves compared to conventional symmetrical contextual factors for both triphone and pentaphone case. Furthermore, we achieved best performance for Gujarati HTS with left-left-left-centre-right (i.e., LLLCR) contextual factors.

[1]  Hemant A. Patil,et al.  A Novel Gaussian Filter-Based Automatic Labeling of Speech Data for TTS System in Gujarati Language , 2013, 2013 International Conference on Asian Language Processing.

[2]  Keikichi Hirose,et al.  Influence of context and knowledge on the perception of continuous speech , 1990, ICSLP.

[3]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[4]  Hemant A. Patil,et al.  Effectiveness of PLP-based phonetic segmentation for speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[6]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[7]  S. R. Mahadeva Prasanna,et al.  A syllable-based framework for unit selection synthesis in 13 Indian languages , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[10]  Yashesh Gaur,et al.  Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[11]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[12]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[13]  Hemant A. Patil,et al.  Phonetic Transcription of Fricatives and Plosives for Gujarati and Marathi Languages , 2012, 2012 International Conference on Asian Language Processing.

[14]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[16]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[18]  Simon King,et al.  An introduction to statistical parametric speech synthesis , 2011 .