Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS

We propose a novel semi-supervised technique that enables expressive style control and cross-speaker transfer in neural text to speech (TTS), when available training data contains a limited amount of labeled expressive speech from a single speaker. The technique is based on unsupervised learning of a style-related latent space, generated by a previously proposed reference audio encoding technique, and transforming it by means of Principal Component Analysis to another low-dimensional space. The latter space represents style information in a purified form, disentangled from text and speaker-related information. Encodings for expressive styles that are present in the training data are easily constructed in this space. Furthermore, this technique provides control over the speech rate, pitch level, and articulation type that can be used for TTS voice transformation. We present the results of subjective crowd evaluations confirming that the synthesized speech convincingly conveys the desired expressive styles and preserves a high level of quality.

[1]  Marius Cotescu,et al.  Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Slava Shechtman,et al.  Controllable Sequence-To-Sequence Neural TTS with LPCNET Backend for Real-time Speech Synthesis on CPU , 2020 .

[3]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[4]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hong-Goo Kang,et al.  An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis , 2019, IEEE Signal Processing Letters.

[6]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[7]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Ryan Prenger,et al.  Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[10]  Slava Shechtman,et al.  Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[11]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[12]  Yuan Jiang,et al.  End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).