The challenge of realistic music generation: modelling raw audio at scale

Realistic music generation is a challenging task. When building generative models of music that are learnt from data, typically high-level representations such as scores or MIDI are used that abstract away the idiosyncrasies of a particular performance. But these nuances are very important for our perception of musicality and realism, so in this work we embark on modelling music in the raw audio domain. It has been shown that autoregressive models excel at generating raw audio waveforms of speech, but when applied to music, we find them biased towards capturing local signal structure at the expense of modelling long-range correlations. This is problematic because music exhibits structure at many different timescales. In this work, we explore autoregressive discrete autoencoders (ADAs) as a means to enable autoregressive models to capture long-range correlations in waveforms. We find that they allow us to unconditionally generate piano music directly in the raw audio domain, which shows stylistic consistency across tens of seconds.

[1]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[2]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[3]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[4]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[5]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[6]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[7]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[8]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[9]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[10]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[11]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[12]  Marc'Aurelio Ranzato,et al.  Learning Longer Memory in Recurrent Neural Networks , 2014, ICLR.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[15]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[16]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[17]  Philip Bachman,et al.  An Architecture for Deep, Hierarchical Generative Models , 2016, NIPS.

[18]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[19]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[20]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[21]  Yann LeCun,et al.  Recurrent Orthogonal Networks and Long-Memory Tasks , 2016, ICML.

[22]  Christoph H. Lampert,et al.  PixelCNN Models with Auxiliary Variables for Natural Image Modeling , 2017, ICML.

[23]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[24]  Thomas S. Huang,et al.  Dilated Recurrent Neural Networks , 2017, NIPS.

[25]  David Vázquez,et al.  PixelVAE: A Latent Variable Model for Natural Images , 2016, ICLR.

[26]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[27]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[28]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[29]  Gaëtan Hadjeres,et al.  Deep Learning Techniques for Music Generation - A Survey , 2017, ArXiv.

[30]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[31]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[32]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[33]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[34]  Ching-Hua Chuan,et al.  A Functional Taxonomy of Music Generation Systems , 2017, ACM Comput. Surv..

[35]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[36]  Aaron C. Courville,et al.  Hierarchical Adversarially Learned Inference , 2018, ArXiv.

[37]  Colin Raffel,et al.  A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music , 2018, ICML.

[38]  C. Lee Giles,et al.  Learning a Hierarchical Latent-Variable Model of 3D Shapes , 2017, 2018 International Conference on 3D Vision (3DV).

[39]  Chris Donahue,et al.  Synthesizing Audio with Generative Adversarial Networks , 2018, ArXiv.

[40]  Quoc V. Le,et al.  Learning Longer-term Dependencies in RNNs with Auxiliary Losses , 2018, ICML.

[41]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[42]  Alex Graves,et al.  Associative Compression Networks for Representation Learning , 2018, ArXiv.

[43]  Alex Graves,et al.  Associative Compression Networks , 2018 .

[44]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[45]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.