论文信息 - Speech Synthesis and Control Using Differentiable DSP

Speech Synthesis and Control Using Differentiable DSP

Modern text-to-speech systems are able to produce natural and high-quality speech, but speech contains factors of variation (e.g. pitch, rhythm, loudness, timbre)\ that text alone cannot contain. In this work we move towards a speech synthesis system that can produce diverse speech renditions of a text by allowing (but not requiring) explicit control over the various factors of variation. We propose a new neural vocoder that offers control of such factors of variation. This is achieved by employing differentiable digital signal processing (DDSP) (previously used only for music rather than speech), which exposes these factors of variation. The results show that the proposed approach can produce natural speech with realistic timbre, and individual factors of variation can be freely controlled.

[1] Xu Tan,et al. FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[2] Michael Schoeffler,et al. webMUSHRA — A Comprehensive Framework for Web-based Listening Tests , 2018 .

[3] Ryan Prenger,et al. Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[5] Ryan Prenger,et al. Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Xin Wang,et al. Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7] Julius O. Smith,et al. Spectral modeling synthesis: A sound analysis/synthesis based on a deterministic plus stochastic decomposition , 1990 .

[8] Kurt Keutzer,et al. SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis , 2020, ArXiv.

[9] Slava Shechtman,et al. Controllable Sequence-To-Sequence Neural TTS with LPCNET Backend for Real-time Speech Synthesis on CPU , 2020 .

[10] Heiga Zen,et al. Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[11] Heiga Zen,et al. Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling , 2020, ArXiv.

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Jae-Sung Bae,et al. Speaking Speed Control of End-to-End Speech Synthesis using Sentence-Level Conditioning , 2020, INTERSPEECH.

[14] Chris Donahue,et al. Adversarial Audio Synthesis , 2018, ICLR.

[15] Tie-Yan Liu,et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2020, ArXiv.

[16] Ryan Prenger,et al. Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis , 2020, ICLR.

[17] Ingo R. Titze,et al. Principles of voice production , 1994 .