A new prosody annotation protocol for live sports commentaries

This paper proposes a new prosody annotation protocol specific to live sports commentaries. Two levels of annotation are defined with HMM-based speech synthesis in view. Local labels are assigned to all syllables and refer to accentual phenomena. Global labels classify sequences of words into five distinct subgenres, defined in terms of valence and arousal. The objective of the study is to provide a set of labels both related to a specific function and characterized by a distinct acoustic realization. The consideration of these constraints should allow for an automatic prediction of the labels both from the text or from the speech signal. Reasonable inter-annotator scores are achieved for both annotation levels. A prosodic analysis of all labels also shows that they can usually be distinguished by specific acoustic realizations. The integration of this new annotation protocol within HMM-based speech synthesis shows promising results.

[1]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[2]  J. Russell,et al.  An approach to environmental psychology , 1974 .

[3]  Piet Mertens,et al.  L'intonation du français. De la description linguistique à la reconnaissance automatique , 1987 .

[4]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[5]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[6]  M. Rossi,et al.  La prosodie du français , 1999 .

[7]  J. Trouvain,et al.  The Prosody of Excitement in Horse Race Commentaries , 2000 .

[8]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[9]  Piet Mertens,et al.  The Prosogram: Semi-Automatic Transcription of Prosody Based on a Tonal Perception Model , 2004 .

[10]  Vincent Colotte,et al.  Linguistic features weighting for a text-to-speech system without prosody model , 2005, INTERSPEECH.

[11]  N. Campbell,et al.  Conversational speech synthesis and the need for some laughter , 2005, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Anne Lacheret,et al.  A methodology for the automatic detection of perceived prominent syllables in spoken French , 2007, INTERSPEECH.

[13]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  智基 戸田,et al.  Recent developments of the HMM-based speech synthesis system (HTS) , 2007 .

[15]  Dagmar Barth-Weingarten,et al.  Prosody in interaction , 2010 .

[16]  Anne Lacheret,et al.  Expectations for discourse genre identification: a prosodic study , 2010, INTERSPEECH.

[17]  Anne-Catherine Simon,et al.  Prominence perception and accent detection in French. A corpus-based account , 2010 .

[18]  Friederike Kern Speaking dramatically : the prosody of live radio commentary of football matches , 2010 .

[19]  Jean-Philippe Goldman,et al.  EasyAlign: An Automatic Phonetic Alignment Tool Under Praat , 2011, INTERSPEECH.

[20]  Jürgen Trouvain,et al.  Between Excitement and Triumph - Live Football Commentaries in Radio vs. TV , 2011, ICPhS.

[21]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[22]  John Kane,et al.  Resonator-based creaky voice detection , 2012, INTERSPEECH.

[23]  Jean-Philippe Goldman,et al.  ProsoDyn: a graphical representation of macroprosody for phonostylistic ambiance change detection , 2012 .

[24]  W. Marsden I and J , 2012 .

[25]  Anne-Catherine Simon,et al.  A Continuous Prominence Score Based On Acoustic Features , 2012, INTERSPEECH.

[26]  Jean-Philippe Goldman,et al.  Sport in the media: a contrasted study of three sport live media reports with semi-automatic tools , 2012 .

[27]  Thierry Dutoit,et al.  The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Sandrine Brognaux,et al.  Train&align: A new online tool for automatic phonetic alignment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).