Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Recent advances of neural TTS have made "human parity" synthesized speech possible when a large amount of studio-quality training data from a voice talent is available. However, with only limited, casual recordings from an ordinary speaker, human-like TTS is still a big challenge, in addition to other artifacts like incomplete sentences, repetition of words, etc. Chinese, a language, of which the text is different from that of other roman-letter based languages like English, has no blank space between adjacent words, hence word segmentation errors can cause serious semantic confusions and unnatural prosody. In this study, with a multi-speaker TTS to accommodate the insufficient training data of a target speaker, we investigate linguistic features and Bert-derived information to improve the prosody of our Mandarin Chinese TTS. Three factors are studied: phone-related and prosody-related linguistic features; better predicted breaks with a refined Bert-CRF model; augmented phoneme sequence with character embedding derived from a Bert model. Subjective tests on in- and out-domain tasks of News, Chat and Audiobook, have shown that all factors are effective for improving prosody of our Mandarin TTS. The model with additional character embeddings from Bert is the best one, which outperforms the baseline by 0.17 MOS gain.

[1]  James R. Glass,et al.  Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models , 2019, ArXiv.

[2]  Rodrigo Nogueira,et al.  Portuguese Named Entity Recognition using BERT-CRF , 2019, ArXiv.

[3]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[4]  Frank K. Soong,et al.  Feature reinforcement with word embedding and parsing information in neural TTS , 2019, ArXiv.

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Zhenyu Wang,et al.  EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System , 2018, INTERSPEECH.

[7]  Tomoki Toda,et al.  Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis , 2019, INTERSPEECH.

[8]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[11]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[12]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[13]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[15]  Frank K. Soong,et al.  Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice , 2018, ArXiv.

[16]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[17]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[19]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[20]  Lidong Bing,et al.  Exploiting BERT for End-to-End Aspect-based Sentiment Analysis , 2019, EMNLP.

[21]  Shuang Xu,et al.  First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention , 2016, INTERSPEECH.

[22]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[23]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ying Chen,et al.  Implementing Prosodic Phrasing in Chinese End-to-end Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).