$$\hbox {F}_{0}$$F0 contour generation and synthesis using Bengali Hmm-based speech synthesis system

HMM based Bengali speech synthesis system (Bengali-HTS) generates highly intelligible synthesized speech but its naturalness is not adequate even though it is trained with a very good amount of speech corpus. In case of interrogative, imperative and exclamatory sentences, naturalness of the synthesized speech falls drastically. This paper proposes a method to overcome this problem by modifying the $$\hbox {F}_{0}$$F0 contour of synthetic speech based on Fujisaki model. The Fujisaki model features for different types of Bengali sentences are analyzed for the generation of $$\hbox {F}_{0}$$F0 contour. These features depend on prosodic word/phrase boundary of the sentence. So a two layer supervised classification and regression tree is trained to predict the prosodic word/phrase boundary. Fujisaki model then generates $$\hbox {F}_{0}$$F0 contour from input text using the prosodic word/phrase boundary and segmental duration information from HMM-based speech synthesis system. Moreover, for HMM training purpose, prosodic structure of sentence has been employed rather than lexical structure. From MOS and preference test it is found that proposed method significantly improved the overall quality of synthesized speech than that of Bengali-HTS.

[1]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[2]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[3]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[4]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[5]  Frank K. Soong,et al.  A frame mapping based HMM approach to cross-lingual voice transformation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Zhizheng Wu,et al.  Duration refinement by jointly optimizing state and longer unit likelihood , 2008, INTERSPEECH.

[7]  Feng Ding,et al.  A polynomial segment model based statistical parametric speech synthesis sytem , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Alan W. Black,et al.  Prosody and the Selection of Source Units for Concatenative Synthesis , 1997 .

[9]  Heiga Zen,et al.  The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006 , 2006, IEICE Trans. Inf. Syst..

[10]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[11]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[12]  Heiga Zen,et al.  AN HMM-BASED SPEECH SYNTHESIS SYSTEM APPLIED TO ENGLISH , 2003 .

[13]  Chung-Hsien Wu,et al.  Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion , 2007, IEEE Transactions on Computers.

[14]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Philip C. Woodland,et al.  Improvements in an HMM-based speech synthesiser , 1995, EUROSPEECH.

[16]  Yannis Stylianou Assessment and correction of voice quality variabilities in large speech databases for concatenative speech synthesis , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[17]  Rosalind W. Picard Affective Computing , 1997 .

[18]  Yu Shi,et al.  Power spectral density based channel equalization of large speech database for concatenative TTS system , 2002, INTERSPEECH.

[19]  K. Hirose,et al.  Analysis and synthesis of F0 contours of declarative, interrogative, and imperative utterances of Bangla , 2012, 2012 International Conference on Speech Database and Assessments.

[20]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  Alan W. Black,et al.  Perfect synthesis for all of the people all of the time , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[22]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[23]  Chung-Hsien Wu,et al.  Cross-lingual frame selection method for polyglot speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[25]  Shyamal Kumar Das Mandal,et al.  A Bengali HMM Based Speech Synthesis System , 2014, ArXiv.

[26]  Sankar Mukherjee,et al.  Bengali parts-of-speech tagging using Global Linear Model , 2013, 2013 Annual IEEE India Conference (INDICON).

[27]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[28]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[29]  Jeff A. Bilmes,et al.  Robust splicing costs and efficient search with BMM Models for concatenative speech synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Sridha Sridharan,et al.  Trainable speech synthesis with trended hidden Markov models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[31]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[32]  H. Kawai,et al.  Study on time-dependent voice quality variation in a large-scale single speaker speech corpus used for speech synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[33]  Rosalind W. Picard Affective computing: (526112012-054) , 1997 .