Decision tree micro-prosody structures for text to speech synthesis

This paper explores the use of micro-prosody in improving the quality of synthesised speech in concatenated text to speech synthesis (TTS) systems. Micro-prosody are defined as prosodic signals within context-dependent triphone units and across neighbouring triphones. Micro-prosody parameters are modelled using a Markovian model whose state distributions depend on the current linguistic-prosodic state as well as the current and the neighbouring phones. The use of various speech unit selection criteria in the design of the TTS sound inventory and their effects in reducing the variance of micro-prosodic parameters in concatenated speech and on the TTS output speech are explored. The effect of the variability of the prosodic parameters of speech in the recorded samples from a given speaker, and the influence of accents, such as the US and the UK accented English, on speech prosody variability and on the design of TTS are considered.

[1]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Levent M. Arslan,et al.  Speaker transformation using sentence HMM based alignments and detailed prosody modification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Hiroshi Ohmura Fine pitch contour extraction by voice fundamental wave filtering method , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Mari Ostendorf,et al.  Prediction of abstract prosodic labels for speech synthesis , 1996, Comput. Speech Lang..

[5]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[6]  Luis A. Hernández Gómez,et al.  Automatic prosodic modeling for speaker and task adaptation in text-to-speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Philip C. Woodland,et al.  Improvements in an HMM-based speech synthesiser , 1995, EUROSPEECH.