论文信息 - NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is dif-ﬁcult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In this paper, we present a no ise- r obust e xpressive TTS model (Nore-Speech), which can robustly transfer speaking style in a noisy reference utterance to synthesized speech. Speciﬁcally, our NoreSpeech includes several components: (1) a novel DiffStyle module, which leverages powerful probabilistic denois-ing diffusion models to learn noise-agnostic speaking style features from a teacher model by knowledge distillation; (2) a VQ-VAE block, which maps the style features into a controllable quantized latent space for improving the generalization of style transfer; and (3) a straight-forward but effective parameter-free text-style alignment module, which enables NoreSpeech to transfer style to a textual input from a length-mismatched reference utterance. Experiments demonstrate that NoreSpeech is more effective than previous expressive TTS models in noise environments. Audio samples and code are available at: http://dongchaoyang.top/NoreSpeech demo/ Index Terms — text to speech, style transfer, diffusion model, knowledge distillation, VQ-VAE

[1] Chao Weng,et al. Diffsound: Discrete Diffusion Model for Text-to-Sound Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2] Tatsuya Harada,et al. SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate , 2022, INTERSPEECH.

[3] Yi Ren,et al. HiFiDenoise: High-Fidelity Denoising Text to Speech with Adversarial Networks , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Yi Ren,et al. GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis , 2022, NeurIPS.

[5] June Sig Sung,et al. Fine-grained Noise Control for Multispeaker Speech Synthesis , 2022, INTERSPEECH.

[6] Alexander Richard,et al. Conditional Diffusion Probabilistic Model for Speech Enhancement , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Juheon Lee,et al. Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations , 2021, NeurIPS.

[8] Eunho Yang,et al. Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation , 2021, ICML.

[9] Prafulla Dhariwal,et al. Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[10] Jia Jia,et al. Towards Multi-Scale Style Control for Expressive Speech Synthesis , 2021, Interspeech.

[11] Keon Lee,et al. STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech , 2021, Interspeech.

[12] Bryan Catanzaro,et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[13] Tie-Yan Liu,et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2020, ICLR.

[14] Jaehyeon Kim,et al. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[15] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[16] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[17] M. Hasegawa-Johnson,et al. Unsupervised Speech Decomposition via Triple Information Bottleneck , 2020, ICML.

[18] James Glass,et al. Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[20] Jian Cheng,et al. Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[21] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[23] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Surya Ganguli,et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.