End-to-End Adversarial Text-to-Speech

Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.

[1]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[2]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[3]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[5]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[6]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[7]  Wei Chen,et al.  Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech , 2020, ArXiv.

[8]  Hiroaki Sakoe,et al.  A Dynamic Programming Approach to Continuous Speech Recognition , 1971 .

[9]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[10]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[11]  Lior Wolf,et al.  VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[12]  Sungwon Kim,et al.  Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search , 2020, NeurIPS.

[13]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[14]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[15]  Soroosh Mariooryad,et al.  Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jae Hyun Lim,et al.  Geometric GAN , 2017, ArXiv.

[17]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[18]  Shlomo Dubnov,et al.  Expediting TTS Synthesis with Adversarial Vocoding , 2019, INTERSPEECH.

[19]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[20]  Wei Ping,et al.  Non-Autoregressive Neural Text-to-Speech , 2020, ICML.

[21]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[22]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[24]  Dustin Tran,et al.  Hierarchical Implicit Models and Likelihood-Free Variational Inference , 2017, NIPS.

[25]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[27]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[28]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[29]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[30]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[33]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[34]  Eunwoo Song,et al.  Probability density distillation with generative adversarial networks for high-quality parallel waveform generation , 2019, INTERSPEECH.

[35]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[36]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[37]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[38]  Lei He,et al.  Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS , 2019, INTERSPEECH.

[39]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[40]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[41]  Zhao Song,et al.  Parallel Neural Text-to-Speech , 2019, ArXiv.

[42]  Mike Lewis,et al.  MelNet: A Generative Model for Audio in the Frequency Domain , 2019, ArXiv.

[43]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[44]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[46]  Bryan Catanzaro,et al.  Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis , 2021, ICLR.

[47]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[48]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[49]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[50]  Erich Elsen,et al.  High Fidelity Speech Synthesis with Adversarial Networks , 2019, ICLR.

[51]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[52]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[53]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[54]  Li-Rong Dai,et al.  Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[56]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[57]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[60]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[61]  Lei Xie,et al.  A New GAN-based End-to-End TTS Training Algorithm , 2019, INTERSPEECH.

[62]  Xiaohua Zhai,et al.  A Large-Scale Study on Regularization and Normalization in GANs , 2018, ICML.

[63]  Sébastien Le Maguer,et al.  How to compare TTS systems: a new subjective evaluation methodology focused on differences , 2015, INTERSPEECH.

[64]  Chenjie Gu,et al.  DDSP: Differentiable Digital Signal Processing , 2020, ICLR.

[65]  Gregory Diamos,et al.  Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks , 2018, IEEE Signal Processing Letters.

[66]  Kainan Peng,et al.  WaveFlow: A Compact Flow-based Model for Raw Audio , 2020, ICML.

[67]  Brian Roark,et al.  Neural Models of Text Normalization for Speech Applications , 2019, Computational Linguistics.

[68]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[69]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[70]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[71]  Marco Cuturi,et al.  Soft-DTW: a Differentiable Loss Function for Time-Series , 2017, ICML.

[72]  Hemant A. Patil,et al.  Fusion of magnitude and phase-based features for objective evaluation of TTS voice , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[73]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[75]  Shuang Liang,et al.  Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[76]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.