论文信息 - Synthesizing sports commentaries: One or several emphatic stresses?

Synthesizing sports commentaries: One or several emphatic stresses?

Emphatic stresses are known to fulfill essential functions in expressive speech. Their integration in speech synthesis usually relies on a prosodic annotation of the training corpus. Emphasized syllables are then assigned a single label or can receive several labels according to their acoustic realization. While it is more complex to predict those various labels for a new text to synthesize, it might allow for a better rendering of the stress in the synthesized speech. This paper examines whether the use of more than one emphatic label improves the perceived expressivity of the synthesized speech. It relies on a manually-annotated expressive corpus of sports commentaries. Statistical acoustic analyses show that four distinct realizations of emphatic stresses can be distinguished. However, perceptual tests indicate that the integration of this distinction in HMM-based speech synthesis does not lead to a significant improvement in expressivity. This seems to imply that the different acoustic realizations of the stress are not required to be explicitly annotated in the training corpus.

[1] H. Cramér. Mathematical methods of statistics , 1947 .

[2] Sandrine Brognaux,et al. Automatic Detection of Syntax-based Prosody Annotation Errors , 2012 .

[3] Simon King,et al. Expressive prosody for unit-selection speech synthesis , 2006, INTERSPEECH.

[4] Harry A. Rositzke,et al. Vowel-Length in General American Speech , 1939 .

[5] Kai Yu,et al. Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6] Takao Kobayashi,et al. Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[7] Dirk Hovy,et al. Analysis and modeling of "focus" in context , 2013, INTERSPEECH.

[8] William M. Rand,et al. Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[9] Thierry Dutoit,et al. The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10] Abeer Alwan,et al. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[11] Alan Julian Izenman,et al. Modern Multivariate Statistical Techniques , 2008 .

[12] Jean-Philippe Goldman,et al. EasyAlign: An Automatic Phonetic Alignment Tool Under Praat , 2011, INTERSPEECH.

[13] J. H. Ward. Hierarchical Grouping to Optimize an Objective Function , 1963 .

[14] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15] 智基戸田,et al. Recent developments of the HMM-based speech synthesis system (HTS) , 2007 .

[16] Junichi Yamagishi,et al. Identification of contrast and its emphatic realization in HMM based speech synthesis , 2009, INTERSPEECH.

[17] Anne-Catherine Simon,et al. A Continuous Prominence Score Based On Acoustic Features , 2012, INTERSPEECH.