A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem, which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at this http URL.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[3]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[8]  Christian Osendorfer,et al.  Learning Stochastic Recurrent Networks , 2014, NIPS 2014.

[9]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[10]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[11]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[14]  Diederik P. Kingma,et al.  Variational Recurrent Auto-Encoders , 2014, ICLR.

[15]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[16]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[17]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[18]  Tom White,et al.  Sampling Generative Networks: Notes on a Few Effective Techniques , 2016, ArXiv.

[19]  Daniel P. W. Ellis,et al.  Extracting Ground-Truth Information from MIDI Files: A MIDIfesto , 2016, ISMIR.

[20]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[21]  Colin Raffel,et al.  Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching , 2016 .

[22]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[23]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[24]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[25]  Tom White,et al.  Sampling Generative Networks: Notes on a Few Effective Techniques , 2016, ArXiv.

[26]  Shan Carter,et al.  Using Artificial Intelligence to Augment Human Intelligence , 2017 .

[27]  David Vázquez,et al.  PixelVAE: A Latent Variable Model for Natural Images , 2016, ICLR.

[28]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[29]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[30]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[31]  Zhiting Hu,et al.  Improved Variational Autoencoders for Text Modeling using Dilated Convolutions , 2017, ICML.

[32]  Erhardt Barth,et al.  A Hybrid Convolutional Variational Autoencoder for Text Generation , 2017, EMNLP.

[33]  Alexander A. Alemi,et al.  An Information-Theoretic Analysis of Deep Latent-Variable Models , 2017, ArXiv.

[34]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[35]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[36]  Douglas Eck,et al.  Learning Latent Representations of Music to Generate Interactive Musical Palettes , 2018, IUI Workshops.

[37]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[38]  Xiang Wei,et al.  Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect , 2018, ICLR.

[39]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[40]  Adam Roberts,et al.  Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models , 2017, ICLR.