Interactive Text-to-Speech via Semi-supervised Style Transfer Learning

With increasing interests in interactive speech systems, speech emotion recognition and multi-style text-to-speech (TTS) synthesis are becoming increasingly important research areas. In this paper, we combine both. We present a method to extract speech style embed-dings from input speech queries and apply this embedding as conditional input to a TTS voice so that the TTS response matches the speaking style of the input query. Specifically, we first train a multi-modal style classification model using acoustic and textual features of speech utterances. Due to a limited amount of labeled data, we combined the emotional recognition dataset: the interactive emotional dyadic motion capture database (IEMOCAP) with a small labeled subset of our internal TTS dataset for style model training. We take the softmax layer from the style classifier as style embedding and then apply this style embedding extraction model to generate soft style labels for our unlabelled internal TTS dataset. With this semi-supervised approach, reliable style embeddings are extracted to train a multi-style TTS system. As a result, we developed a controllable multi-style TTS system whose response matches the given target styles embedding, which could be extracted from the input query or manually assigned.

