Interactive Text-to-Speech via Semi-supervised Style Transfer Learning

With increasing interests in interactive speech systems, speech emotion recognition and multi-style text-to-speech (TTS) synthesis are becoming increasingly important research areas. In this paper, we combine both. We present a method to extract speech style embed-dings from input speech queries and apply this embedding as conditional input to a TTS voice so that the TTS response matches the speaking style of the input query. Specifically, we first train a multi-modal style classification model using acoustic and textual features of speech utterances. Due to a limited amount of labeled data, we combined the emotional recognition dataset: the interactive emotional dyadic motion capture database (IEMOCAP) with a small labeled subset of our internal TTS dataset for style model training. We take the softmax layer from the style classifier as style embedding and then apply this style embedding extraction model to generate soft style labels for our unlabelled internal TTS dataset. With this semi-supervised approach, reliable style embeddings are extracted to train a multi-style TTS system. As a result, we developed a controllable multi-style TTS system whose response matches the given target styles embedding, which could be extracted from the input query or manually assigned.

[1]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[2]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[3]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[4]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[5]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[6]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[8]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[9]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11]  Srikanth Ronanki,et al.  The Blizzard Challenge 2017 , 2017 .

[12]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[13]  Bernhard Schölkopf,et al.  Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , 2018, ICML.

[14]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Takao Kobayashi,et al.  Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Jiaying Liu,et al.  Adaptive Batch Normalization for practical domain adaptation , 2018, Pattern Recognit..

[17]  J. Yamagishi,et al.  HMM-Based Speech Synthesis with Various Speaking Styles Using Model Interpolation , 2004 .

[18]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[19]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Junichi Yamagishi,et al.  Principles for Learning Controllable TTS from Annotated and Latent Variation , 2017, INTERSPEECH.

[22]  Srikanth Ronanki,et al.  Learning Interpretable Control Dimensions for Speech Synthesis by Using External Data , 2018, INTERSPEECH.

[23]  Thierry Dutoit,et al.  Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis , 2019, INTERSPEECH.

[24]  Yuxuan Wang,et al.  Uncovering Latent Style Factors for Expressive Speech Synthesis , 2017, ArXiv.