Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0

In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as "intonation codes". Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, listeners commented on the intonation codes having a statement or question style. However, many other affect-related styles were also reported, including: emotional, uncertain, surprised, sarcastic, passive aggressive, and upset.

[1]  Timo Baumann,et al.  An Empirical Analysis of the Correlation of Syntax and Prosody , 2018, INTERSPEECH.

[2]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[3]  Simon King,et al.  A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Thomas Drugman,et al.  Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection , 2019, INTERSPEECH.

[5]  P. Prieto,et al.  Preschoolers use prosodic mitigation strategies to encode polite stance , 2018, Speech Prosody 2018.

[6]  Vincent Wan,et al.  CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network , 2019, ICML.

[7]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[8]  José Ignacio Hualde,et al.  Listening for sound, listening for meaning: Task effects on prosodic transcription , 2014 .

[9]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Srikanth Ronanki,et al.  A Template-Based Approach for Speech Synthesis Intonation Generation Using LSTMs , 2016, INTERSPEECH.

[11]  Henrik Niemann,et al.  Integrating the discreteness and continuity of intonational categories , 2017, J. Phonetics.

[12]  Joseph Roy,et al.  Crowd-sourcing prosodic annotation , 2017, Comput. Speech Lang..

[13]  Adam J. Royer,et al.  Prominence perception is dependent on phonology, semantics, and awareness of discourse , 2017 .

[14]  Johanna D. Moore,et al.  Paragraph-based prosodic cues for speech synthesis applications , 2016 .

[15]  Oliver Watts,et al.  Using generative modelling to produce varied intonation for speech synthesis , 2019, ArXiv.

[16]  P. Prieto,et al.  The contribution of context and contour to perceived belief in polar questions , 2015 .

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Zhizheng Wu,et al.  Sentence-level control vectors for deep neural network speech synthesis , 2015, INTERSPEECH.

[19]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[20]  Nigel G. Ward Prosodic Patterns in English Conversation , 2019 .

[21]  Xin Wang,et al.  Initial investigation of encoder-decoder end-to-end TTS using marginalization of monotonic hard alignments , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[22]  Michael Wagner,et al.  Toward a bestiary of English intonational contours* , 2016 .

[23]  Xin Wang,et al.  Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis , 2018, ArXiv.

[24]  Yuxuan Wang,et al.  Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[25]  Petra Wagner,et al.  The Greennn Tree - Lengthening Position Influences Uncertainty Perception , 2019, INTERSPEECH.

[26]  Mireia Farrús,et al.  Using Prosody to Classify Discourse Relations , 2017, INTERSPEECH.

[27]  Joakim Gustafson,et al.  Spontaneous Conversational Speech Synthesis from Found Data , 2019, INTERSPEECH.

[28]  Junichi Yamagishi,et al.  HMM-BASED EXPRESSIVE SPEECH SYNTHESIS — TOWARDS TTS WITH ARBITRARY SPEAKING STYLES AND EMOTIONS , 2003 .

[29]  Catherine Lai,et al.  What do you mean, you're uncertain?: the interpretation of cue words and rising intonation in dialogue , 2010, INTERSPEECH.

[30]  Kenneth Ward Church,et al.  Text Analysis and Word Pronunciation in Text-to-speech Synthesis , 2013 .

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Gustav Eje Henter,et al.  Casting to Corpus: Segmenting and Selecting Spontaneous Dialogue for Tts with a Cnn-lstm Speaker-dependent Breath Detector , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  S. Calhoun The centrality of metrical structure in signaling information structure: A probabilistic perspective , 2010 .

[35]  Max Welling,et al.  VAE with a VampPrior , 2017, AISTATS.

[36]  Olac Fuentes,et al.  Inferring stance in news broadcasts from prosodic-feature configurations , 2018, Comput. Speech Lang..

[37]  Lei He,et al.  Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS , 2019, INTERSPEECH.

[38]  Valerie Freeman,et al.  Prosodic features of stances in conversation , 2019, Laboratory Phonology: Journal of the Association for Laboratory Phonology.

[39]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.