Exploring emotional prototypes in a high dimensional TTS latent space

Recent TTS systems are able to generate prosodically varied and realistic speech. However, it is unclear how this prosodic variation contributes to the perception of speakers’ emotional states. Here we use the recent psychological paradigm ‘Gibbs Sampling with People’ to search the prosodic latent space in a trained GST Tacotron model to explore prototypes of emotional prosody. Participants are recruited online and collectively manipulate the latent space of the generative speech model in a sequentially adaptive way so that the stimulus presented to one group of participants is determined by the response of the previous groups. We demonstrate that (1) particular regions of the model’s latent space are reliably associated with particular emotions, (2) the resulting emotional prototypes are wellrecognized by a separate group of human raters, and (3) these emotional prototypes can be effectively transferred to new sentences. Collectively, these experiments demonstrate a novel approach to the understanding of emotional speech by providing a tool to explore the relation between the latent space of generative models and human semantics.

[1]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Aleix M. Martinez,et al.  Emotional Expressions Reconsidered: Challenges to Inferring Emotion From Human Facial Movements , 2019, Psychological science in the public interest : a journal of the American Psychological Society.

[3]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[4]  Peter M C Harrison,et al.  Gibbs Sampling with People , 2020, NeurIPS.

[5]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[6]  Bryan Catanzaro,et al.  Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis , 2021, ICLR.

[7]  C. B. Colby The weirdest people in the world , 1973 .

[8]  Michael C. Mangini,et al.  Making the ineffable explicit: estimating the information employed for face classifications , 2004, Cogn. Sci..

[9]  Bart de Boer,et al.  Introducing Parselmouth: A Python interface to Praat , 2018, J. Phonetics.

[10]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[11]  Josh H McDermott,et al.  Headphone screening to facilitate web-based auditory experiments , 2017, Attention, Perception, & Psychophysics.

[12]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[13]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[14]  Thomas L. Griffiths,et al.  Markov Chain Monte Carlo with People , 2007, NIPS.

[15]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[16]  Ryan Prenger,et al.  Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Andrey Anikin,et al.  Perceptual and acoustic differences between authentic and acted nonverbal emotional vocalizations , 2017, Quarterly journal of experimental psychology.

[18]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.