Duration Modeling by Multi-Models based on Vowel Production characteristics

An accurate estimation of segmental durations is needed for natural sounding textto-speech (TTS) synthesis. This paper propose multi-models based on production aspects of vowels. In this work four multi-models are developed based on vowel length, vowel height, vowel frontness and vowel roundness. In each multimodel, syllables are divided into groups based on specific vowel articulation characteristics. In this study, (i) linguistic constraints represented by positional, contextual and phonological features and (ii) production constraints represented by articulatory features are used for predicting duration patterns. Feed-forward Neural Networks are used for developing duration models. From the results, it was observed that the average prediction error is reduced by 23.21% and correlation coefficient is improved by 9.64% using multi-model developed based on vowel length production characteristics, compared to single duration model.

[1]  Bayya Yegnanarayana,et al.  Modeling durations of syllables using neural networks , 2007, Comput. Speech Lang..

[2]  Hansjörg Mixdorff,et al.  Implementing and evaluating an integrated approach to modeling German prosody , 2001, SSW.

[3]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[4]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[5]  N. Umeda,et al.  Linguistic rules for text-to-speech synthesis , 1976, Proceedings of the IEEE.

[6]  Alan W. Black,et al.  Limited domain synthesis , 2000, INTERSPEECH.

[7]  V. Ramu Reddy,et al.  Development of syllable-based text to speech synthesis system in Bengali , 2011, Int. J. Speech Technol..

[8]  V. Ramu Reddy,et al.  Two-stage intonation modeling using feedforward neural networks for syllable based text-to-speech synthesis , 2013, Comput. Speech Lang..

[9]  V. Ramu Reddy,et al.  Better human computer interaction by enhancing the quality of text-to-speech synthesis , 2012, 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI).

[10]  Ricardo de Córdoba,et al.  Automatic modeling of duration in a Spanish text-to-speech system using neural networks , 1999, EUROSPEECH.

[11]  Chiu-yu Tseng,et al.  The synthesis rules in a Chinese text-to-speech system , 1989, IEEE Trans. Acoust. Speech Signal Process..

[12]  W. N. Campbell Analog I/O nets for syllable timing , 1990, Speech Commun..

[13]  A. Roli Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[14]  Diamantino Freitas,et al.  Segmental durations predicted with a neural network , 2003, INTERSPEECH.

[15]  An Indian Language SIGNIFICANCE OF DURATIONAL KNOWLEDGE FOR A TEXT -TO-SPEECH SYSTEM IN , 1990 .

[16]  Hema A. Murthy,et al.  Duration modeling of Indian languages Hindi and Telugu , 2004, SSW.

[17]  Barbara Heuft,et al.  Prosody generation with a neural network: weighing the importance of input parameters , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Mohsen Rashwan,et al.  Duration modeling for arabic text to speech synthesis , 2002, INTERSPEECH.