High Fidelity Speech Synthesis with Adversarial Networks

Generative adversarial networks have seen rapid development in recent years and have led to remarkable improvements in generative modelling of images. However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech. To address this paucity, we introduce GAN-TTS, a Generative Adversarial Network for Text-to-Speech. Our architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyse the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced. To measure the performance of GAN-TTS, we employ both subjective human evaluation (MOS - Mean Opinion Score), as well as novel quantitative metrics (Frechet DeepSpeech Distance and Kernel DeepSpeech Distance), which we find to be well correlated with MOS. We show that GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator. Listen to GAN-TTS reading this abstract at this https URL.

[1]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[2]  Jonathan Le Roux,et al.  FAST SIGNAL RECONSTRUCTION FROM MAGNITUDE STFT SPECTROGRAM BASED ON SPECTROGRAM CONSISTENCY , 2010 .

[3]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[4]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[5]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[6]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Chuan Li,et al.  Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks , 2016, ECCV.

[10]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[12]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[13]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[14]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[15]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[16]  Andrew Brock,et al.  Neural Photo Editing with Introspective Adversarial Networks , 2016, ICLR.

[17]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[18]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[19]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[21]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[22]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[24]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[25]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[26]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[27]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[28]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[29]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[31]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Shinnosuke Takamichi,et al.  Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Dominik Roblek,et al.  Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms , 2018, ArXiv.

[34]  Jan Kautz,et al.  Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[35]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[36]  Kou Tanaka,et al.  Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[37]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[38]  Boris Ginsburg,et al.  OpenSeq2Seq: Extensible Toolkit for Distributed and Mixed Precision Training of Sequence-to-Sequence Models , 2018, ArXiv.

[39]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[40]  Shunta Saito,et al.  TGANv2: Efficient Training of Large Models for Video Generation with Multiple Subsampling Layers , 2018, ArXiv.

[41]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[42]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[43]  Mike Lewis,et al.  MelNet: A Generative Model for Audio in the Frequency Domain , 2019, ArXiv.

[44]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[45]  Jeff Donahue,et al.  Adversarial Video Generation on Complex Datasets , 2019 .

[46]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[48]  Jeff Donahue,et al.  Efficient Video Generation on Complex Datasets , 2019, ArXiv.

[49]  Eunwoo Song,et al.  Probability density distillation with generative adversarial networks for high-quality parallel waveform generation , 2019, INTERSPEECH.

[50]  Shlomo Dubnov,et al.  Expediting TTS Synthesis with Adversarial Vocoding , 2019, INTERSPEECH.

[51]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[52]  Kumar Krishna Agrawal,et al.  GANSynth: Adversarial Neural Audio Synthesis , 2019, ICLR.

[53]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[54]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[55]  Jeff Donahue,et al.  Large Scale Adversarial Representation Learning , 2019, NeurIPS.

[56]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).