Synthesizing sports commentaries: One or several emphatic stresses?

Emphatic stresses are known to fulfill essential functions in expressive speech. Their integration in speech synthesis usually relies on a prosodic annotation of the training corpus. Emphasized syllables are then assigned a single label or can receive several labels according to their acoustic realization. While it is more complex to predict those various labels for a new text to synthesize, it might allow for a better rendering of the stress in the synthesized speech. This paper examines whether the use of more than one emphatic label improves the perceived expressivity of the synthesized speech. It relies on a manually-annotated expressive corpus of sports commentaries. Statistical acoustic analyses show that four distinct realizations of emphatic stresses can be distinguished. However, perceptual tests indicate that the integration of this distinction in HMM-based speech synthesis does not lead to a significant improvement in expressivity. This seems to imply that the different acoustic realizations of the stress are not required to be explicitly annotated in the training corpus.

[1]  H. Cramér Mathematical methods of statistics , 1947 .

[2]  Sandrine Brognaux,et al.  Automatic Detection of Syntax-based Prosody Annotation Errors , 2012 .

[3]  Simon King,et al.  Expressive prosody for unit-selection speech synthesis , 2006, INTERSPEECH.

[4]  Harry A. Rositzke,et al.  Vowel-Length in General American Speech , 1939 .

[5]  Kai Yu,et al.  Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[7]  Dirk Hovy,et al.  Analysis and modeling of "focus" in context , 2013, INTERSPEECH.

[8]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[9]  Thierry Dutoit,et al.  The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[11]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques , 2008 .

[12]  Jean-Philippe Goldman,et al.  EasyAlign: An Automatic Phonetic Alignment Tool Under Praat , 2011, INTERSPEECH.

[13]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[14]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  智基 戸田,et al.  Recent developments of the HMM-based speech synthesis system (HTS) , 2007 .

[16]  Junichi Yamagishi,et al.  Identification of contrast and its emphatic realization in HMM based speech synthesis , 2009, INTERSPEECH.

[17]  Anne-Catherine Simon,et al.  A Continuous Prominence Score Based On Acoustic Features , 2012, INTERSPEECH.

[18]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[19]  Jürgen Trouvain,et al.  Between Excitement and Triumph - Live Football Commentaries in Radio vs. TV , 2011, ICPhS.

[20]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[21]  Anne Lacheret,et al.  A methodology for the automatic detection of perceived prominent syllables in spoken French , 2007, INTERSPEECH.

[22]  Julia Hirschberg,et al.  Accent and Discourse Context: Assigning Pitch Accent in Synthetic Speech , 1990, AAAI.

[23]  Ren-Hua Wang,et al.  HMM-Based Emotional Speech Synthesis Using Average Emotion Model , 2006, ISCSLP.

[24]  Vincent Colotte,et al.  Linguistic features weighting for a text-to-speech system without prosody model , 2005, INTERSPEECH.

[25]  Sandrine Brognaux,et al.  HMM-based speech synthesis of live sports commentaries: integration of a two-layer prosody annotation , 2013, SSW.

[26]  Sandrine Brognaux,et al.  A new prosody annotation protocol for live sports commentaries , 2013, INTERSPEECH.

[27]  Bhuvana Ramabhadran,et al.  Automatic exploration of corpus-specific properties for expressive text-to-speech: a case study in emphasis , 2007, SSW.

[28]  Antoine Raux,et al.  A unit selection approach to F0 modeling and its application to emphasis , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[29]  Daniel Hirst,et al.  Form and function in the representation of speech prosody , 2005, Speech Commun..

[30]  Richard A. Parker,et al.  Designing and conducting survey research , 2016 .

[31]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[32]  Heiga Zen,et al.  Constructing emotional speech synthesizers with limited speech database , 2004, INTERSPEECH.

[33]  Simon King,et al.  Modelling prominence and emphasis improves unit-selection synthesis , 2007, INTERSPEECH.

[34]  Sandrine Brognaux,et al.  Train&align: A new online tool for automatic phonetic alignment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).