GANSynth: Adversarial Neural Audio Synthesis

Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence. Autoregressive models, such as WaveNet, model local structure at the expense of global latent structure and slow iterative sampling, while Generative Adversarial Networks (GANs), have global latent conditioning and efficient parallel sampling, but struggle to generate locally-coherent audio waveforms. Herein, we demonstrate that GANs can in fact generate high-fidelity and locally-coherent audio by modeling log magnitudes and instantaneous frequencies with sufficient frequency resolution in the spectral domain. Through extensive empirical investigations on the NSynth dataset, we demonstrate that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.

[1]  Mark Dolson,et al.  The Phase Vocoder: A Tutorial , 1986 .

[2]  Boualem Boashash,et al.  Estimating and interpreting the instantaneous frequency of a signal. II. A/lgorithms and applications , 1992, Proc. IEEE.

[3]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[4]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[8]  Jesse Engel,et al.  Learning Multiscale Features Directly from Waveforms , 2016, INTERSPEECH.

[9]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[10]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[11]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[12]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[13]  Yingtao Tian,et al.  Towards the Automatic Anime Characters Creation with Generative Adversarial Networks , 2017, ArXiv.

[14]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[15]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[17]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[18]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[19]  David Berthelot,et al.  BEGAN: Boundary Equilibrium Generative Adversarial Networks , 2017, ArXiv.

[20]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[21]  Lior Wolf,et al.  Unsupervised Creation of Parameterized Avatars , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[23]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[24]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[25]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[26]  Jacob Abernethy,et al.  On Convergence and Stability of GANs , 2018 .

[27]  Richard Socher,et al.  A Multi-Discriminator CycleGAN for Unsupervised Non-Parallel Speech Domain Adaptation , 2018, INTERSPEECH.

[28]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[29]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[30]  Nicolas Usunier,et al.  SING: Symbol-to-Instrument Neural Generator , 2018, NeurIPS.

[31]  Stéphane Mallat,et al.  Music Generation and Transformation with Moment Matching-Scattering Inverse Networks , 2018, ISMIR.

[32]  Adrien Bitton,et al.  Generative timbre spaces with variational audio synthesis , 2018, ArXiv.

[33]  Adrien Bitton,et al.  Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces , 2018, ISMIR.

[34]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[35]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[36]  Richard Socher,et al.  Augmented Cyclic Adversarial Learning for Low Resource Domain Adaptation , 2018, ICLR.

[37]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).