Chunked Autoregressive GAN for Conditional Waveform Synthesis

Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the waveform during each forward pass. Relative to prior state-of-the-art GAN-based models, our proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for realtime or interactive applications, and maintains or improves subjective quality.

[1]  R. G. McCurdy Tentative standards for sound level meters , 1936, Electrical Engineering.

[2]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[3]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[4]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[5]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[6]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[7]  Mike Senior,et al.  Mixing Secrets for the Small Studio , 2011 .

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[10]  Gautham J. Mysore,et al.  Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges , 2015, IEEE Signal Processing Letters.

[11]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[14]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[15]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[18]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[19]  Gregory Frederick Diamos,et al.  Block-Sparse Recurrent Neural Networks , 2017, ArXiv.

[20]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[21]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[22]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[23]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[24]  Jong Wook Kim,et al.  Crepe: A Convolutional Representation for Pitch Estimation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Adam Finkelstein,et al.  Fftnet: A Real-Time Speaker-Dependent Neural Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[27]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[28]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[30]  Douglas Eck,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[31]  Fabian-Robert Stöter,et al.  MUSDB18-HQ - an uncompressed version of MUSDB18 , 2019 .

[32]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[33]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[34]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[35]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) , 2019 .

[37]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[38]  Wei Ping,et al.  WaveFlow: A Compact Flow-based Model for Raw Audio , 2019, ICML.

[39]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[40]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[41]  Gautham J. Mysore,et al.  Controllable Neural Prosody Synthesis , 2020, INTERSPEECH.

[42]  Erich Elsen,et al.  High Fidelity Speech Synthesis with Adversarial Networks , 2019, ICLR.

[43]  M. Hasegawa-Johnson,et al.  Unsupervised Speech Decomposition via Triple Information Bottleneck , 2020, ICML.

[44]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[45]  Bernt Schiele,et al.  You Only Need Adversarial Supervision for Semantic Image Synthesis , 2020, ICLR.

[46]  Jungil Kong,et al.  Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech , 2021, ICML.

[47]  Chris G. Willcocks,et al.  Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  K. Simonyan,et al.  End-to-End Adversarial Text-to-Speech , 2020, ICLR.

[49]  Nicholas J. Bryan,et al.  Segmented DAPS (Device and Produced Speech) Dataset , 2021 .