Estimating Mutual Information in Prosody Representation for Emotional Prosody Transfer in Speech Synthesis

An end-to-end prosody transfer system aims to transfer the speech prosody from one speaker to another speaker. One major application is the generation of emotional speech with a new speaker’s voice. The end-to-end system uses an intermediate representation of prosody, which encompasses both speaker and emotion related information. The present study tackles the problem of estimating the mutual information between emotion and speaker-related factors in the prosody representation. A mutual information neural estimator (MINE) which could measure the mutual information between high-dimensional continuous prosody embedding and discrete speaker/emotion label is applied. The experimental results show that: 1) the prosody representation generated by the end-to-end system indeed contains both emotion and speaker information; 2) The mutual information would be determined by the type of input acoustic features to the reference encoder; 3) normalization for the log F0 feature is very effective in increasing emotion-related information in the prosody representation; 4) adversarial learning can be applied to reduce speaker information in the prosody representation. These results are useful to the further development of an optimal and practical emotional prosody transfer systems.

[1]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[2]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[3]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[4]  D. W. Robinson,et al.  Psychoacoustics—facts and models , 1991 .

[5]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[6]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[7]  Oliver Watts,et al.  Towards speaking style transplantation in speech synthesis , 2013, SSW.

[8]  Björn Schuller,et al.  Computational Paralinguistics , 2013 .

[9]  Takao Kobayashi,et al.  Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing , 2005, IEICE Trans. Inf. Syst..

[10]  Mark J. F. Gales,et al.  Speaker and Expression Factorization for Audiobook Data: Expressiveness and Transplantation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[12]  Oliver Niebuhr,et al.  Understanding prosody : the role of context, function and communication , 2012 .

[13]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[14]  P. Ekman An argument for basic emotions , 1992 .

[15]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[16]  Mark J. F. Gales,et al.  Speech factorization for HMM-TTS based on cluster adaptive training , 2012, INTERSPEECH.

[17]  Silvia Quazza,et al.  Towards emotional speech synthesis: a rule based approach , 2004, SSW.

[18]  A. Leentjens,et al.  Disturbances of affective prosody in patients with schizophrenia; a cross sectional study , 1998, Journal of neurology, neurosurgery, and psychiatry.

[19]  Dirk Heylen,et al.  Generating expressive speech for storytelling applications , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  James Glass,et al.  Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Yannis Stylianou,et al.  Adaptation of an Expressive Single Speaker Deep Neural Network Speech Synthesis System , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Yamato Ohtani,et al.  Emotional transplant in statistical speech synthesis based on emotion additive model , 2015, INTERSPEECH.

[23]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[24]  Simon King,et al.  Disentangling Style Factors from Speaker Representations , 2019, INTERSPEECH.

[25]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[26]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[27]  Tan Lee,et al.  Revisiting Hidden Markov Models for Speech Emotion Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Duane G. Watson,et al.  Experimental and theoretical advances in prosody: A review , 2010, Language and cognitive processes.

[30]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[31]  Francesc Alías,et al.  Prosodic analysis of storytelling discourse modes and narrative situations oriented to text-to-speech synthesis , 2013, SSW.

[32]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.