论文信息 - Non-Autoregressive Neural Text-to-Speech

Non-Autoregressive Neural Text-to-Speech

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.

Wei Ping | Zhao Song | Kainan Peng | Kexin Zhao

[1] Xu Tan,et al. FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[2] Jason Lee,et al. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[3] Patrick Nguyen,et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[4] Rob Fergus,et al. Stochastic Video Generation with a Learned Prior , 2018, ICML.

[5] Wei Ping,et al. WaveFlow: A Compact Flow-based Model for Raw Audio , 2019, ICML.

[6] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[7] Heiga Zen,et al. Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[8] Cha Zhang,et al. CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Yoshua Bengio,et al. A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[10] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[11] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Heiga Zen,et al. Sample Efficient Adaptive Text-to-Speech , 2018, ICLR.

[13] Wei Ping,et al. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[14] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[15] Yoshua Bengio,et al. NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[16] Max Welling,et al. Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[17] Victor O. K. Li,et al. Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[18] Shakir Mohamed,et al. Variational Inference with Normalizing Flows , 2015, ICML.

[19] Lior Wolf,et al. VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[20] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[21] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[22] Alexandre Lacoste,et al. Probability Distillation: A Caveat and Alternatives , 2019, UAI.

[23] Shujie Liu,et al. Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[24] Lior Wolf,et al. Fitting New Speakers Based on a Short Untranscribed Sample , 2018, ICML.

[25] Sercan Ömer Arik,et al. Neural Voice Cloning with a Few Samples , 2018, NeurIPS.

[26] Ryan Prenger,et al. Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[28] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[29] Sungwon Kim,et al. FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[30] Sercan Ömer Arik,et al. Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[31] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32] Samy Bengio,et al. Generating Sentences from a Continuous Space , 2015, CoNLL.

[33] Eunwoo Song,et al. Probability density distillation with generative adversarial networks for high-quality parallel waveform generation , 2019, INTERSPEECH.

[34] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[35] Prafulla Dhariwal,et al. Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[36] Samy Bengio,et al. Density estimation using Real NVP , 2016, ICLR.

[37] Erich Elsen,et al. Efficient Neural Audio Synthesis , 2018, ICML.

[38] Xin Wang,et al. Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).