Hierarchical Stress Modeling in Mandarin Text-to-Speech

Automatic stress prediction is helpful for both speech synthesis and natural speech understanding. This paper proposes a novel hierarchical Mandarin stress modeling method. The top level emphasizes stressed syllables, while the bottom level focuses on unstressed syllables for the first time due to its importance in both naturalness and expressiveness of synthetic speech. Maximum Entropy model is adopted to predict stress structure from textual features. Experiments show that the modeling method could capture the macro- and micro-characteristics of stress successfully. The F-score of two-level stress predictions are 73.3% and 78.7%, respectively, which are satisfactory compared to other prosody predictions. Index Terms: Text-to-Speech, prosody, stress, Mandarin

[1]  Wu Hua,et al.  The phonetic labeling on read and spontaneous discourse corpora , 2000, INTERSPEECH.

[2]  Meng Zhang,et al.  Text-based unstressed syllable prediction in Mandarin , 2010, INTERSPEECH.

[3]  Bo Xu,et al.  Mandarin pitch accent prediction using hierarchical model based ensemble machine learning , 2009, 2009 IEEE Youth Conference on Information, Computing and Telecommunication.

[4]  Shrikanth S. Narayanan,et al.  Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[6]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[7]  Frank K. Soong,et al.  A hierarchical F0 modeling method for HMM-based speech synthesis , 2010, INTERSPEECH.

[8]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[9]  J. Fodor,et al.  The Psychology of Language , 1974 .

[10]  G. Seth Psychology of Language , 1968, Nature.

[11]  Ting,et al.  Study on automatic prediction of sentential stress for Chinese Putonghua Text-to-Speech system with natural style , 2007 .

[12]  Heiga Zen,et al.  Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.

[13]  Frank K. Soong,et al.  Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[14]  Kai Yu,et al.  Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.