Parallel Tacotron: Non-Autoregressive and Controllable TTS

Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called Parallel Tacotron, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.

[1]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[2]  Colin W. Wightman,et al.  The aligner: text to speech alignment using Markov models and a pronunciation dictionary , 1994, SSW.

[3]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[6]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[9]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[10]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[13]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[14]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[15]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[16]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[18]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[19]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[20]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[21]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[22]  Lei He,et al.  Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS , 2019, INTERSPEECH.

[23]  Jiangyan Yi,et al.  Forward–Backward Decoding Sequence for Regularizing End-to-End TTS , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Chengzhu Yu,et al.  DurIAN: Duration Informed Attention Network For Multimodal Synthesis , 2019, ArXiv.

[25]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[26]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[27]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[28]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[29]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[30]  Lei Xie,et al.  A New GAN-based End-to-End TTS Training Algorithm , 2019, INTERSPEECH.

[31]  Eric Battenberg,et al.  Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  G. Zweig,et al.  DEJA-VU: Double Feature Presentation and Iterated Loss in Deep Transformer Networks , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Dan Lim,et al.  JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment , 2020, INTERSPEECH.

[34]  Tian Xia,et al.  Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Heiga Zen,et al.  Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Shuang Liang,et al.  Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Kyunghyun Cho,et al.  Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior , 2019, AAAI.

[38]  Boris Ginsburg,et al.  TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model , 2020 .

[39]  Heiga Zen,et al.  Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling , 2020, ArXiv.

[40]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[41]  Adrian La'ncucki Fastpitch: Parallel Text-to-Speech with Pitch Prediction , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  K. Simonyan,et al.  End-to-End Adversarial Text-to-Speech , 2020, ICLR.