Generating Music by Fine-Tuning Recurrent Neural Networks with Reinforcement Learning

Supervised learning with next-step prediction is a common way to train a sequence prediction model; however, it suffers from known failure modes and is notoriously difficult to train models to learn certain properties, such as having a coherent global structure. Reinforcement learning can be used to impose arbitrary properties on generated data by choosing appropriate reward functions. In this paper we propose a novel approach for sequence training, where we refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from stochastic optimal control (SOC). We explore the usefulness of our approach in the context of music generation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note-RNN is then refined using RL, where the reward function is a combination of rewards based on rules of music theory, as well as the output of another trained Note-RNN. We show that this combination of ML and RL can not only produce more pleasing melodies, but that it can significantly reduce unwanted behaviors and failure modes of the RNN.

[1]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[2]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[3]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[4]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[5]  David Barber,et al.  A generative model for music transcription , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[7]  Ferenc Huszar,et al.  How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? , 2015, ArXiv.

[8]  C. Palmer,et al.  Emotional response to musical repetition. , 2012, Emotion.

[9]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[10]  Dale Schuurmans,et al.  Reward Augmented Maximum Likelihood for Neural Structured Prediction , 2016, NIPS.

[11]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[12]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[13]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[14]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[15]  Richard E. Turner,et al.  Neural Adaptive Sequential Monte Carlo , 2015, NIPS.

[16]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[17]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[18]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[19]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[20]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[21]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[22]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[25]  Jürgen Schmidhuber,et al.  Finding temporal structure in music: blues improvisation with LSTM recurrent networks , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[26]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[27]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[28]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[29]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[30]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[31]  Bob L. Sturm,et al.  Music transcription modelling and composition using deep learning , 2016, ArXiv.

[32]  Tetsunori Kobayashi,et al.  Multiscale recurrent neural network based language model , 2015, INTERSPEECH.

[33]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[34]  R. Gauldin A practical approach to eighteenth-century counterpoint , 1985 .

[35]  N. Roy,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2013 .