An End to End Model for Automatic Music Generation: Combining Deep Raw and Symbolic Audio Networks

We develop an approach to combine two types of music generation models, namely symbolic and raw audio models. While symbolic models typically operate at the note level and are able to capture long-term dependencies, they lack the expressive richness and nuance of performed music. Raw audio models train directly on raw audio waveforms, and can be used to produce expressive music; however, these models typically lack structure and long-term dependencies. We describe a work-in-progress model that trains a raw audio model based on the recently-proposed WaveNet architecture, but that incorporates the notes of the composition as a secondary input to the network. When generating novel compositions, we utilize an LSTM network whose output feeds into the raw audio model, thus yielding an end-to-end model that generates raw audio outputs combining the best of both worlds. We describe initial results of our approach, which we believe to show considerable promise for structured music generation.

[1]  Michael C. Mozer,et al.  Neural Network Music Composition by Prediction: Exploring the Benefits of Psychoacoustic Constraints and Multi-scale Processing , 1994, Connect. Sci..

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  J. Schmidhuber,et al.  A First Look at Music Composition using LSTM Recurrent Neural Networks , 2002 .

[4]  Meinard Müller,et al.  SAARLAND MUSIC DATA ( SMD ) , 2011 .

[5]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[6]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Kratarth Goel,et al.  Polyphonic Music Generation by Modeling Temporal Dependencies Using a RNN-DBN , 2014, ICANN.

[8]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[9]  Jun Zhu,et al.  Modelling High-Dimensional Sequences with LSTM-RTRBM: Application to Polyphonic Music Generation , 2015, IJCAI.

[10]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[11]  Bob L. Sturm,et al.  Music transcription modelling and composition using deep learning , 2016, ArXiv.

[12]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[13]  Allen Huang,et al.  Deep Learning for Music , 2016, ArXiv.

[14]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[15]  Thomas S. Huang,et al.  Fast Wavenet Generation Algorithm , 2016, ArXiv.

[16]  Sanja Fidler,et al.  Song From PI: A Musically Plausible Network for Pop Music Generation , 2016, ICLR.

[17]  Daniel D. Johnson,et al.  Generating Polyphonic Music Using Tied Parallel Networks , 2017, EvoMUSART.

[18]  Jordi Bonada,et al.  A Neural Parametric Singing Synthesizer , 2017, INTERSPEECH.

[19]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[20]  Brian Kulis,et al.  Conditioning Deep Generative Raw Audio Models for Structured Automatic Music , 2018, ISMIR.

[21]  Gaëtan Hadjeres,et al.  Deep Learning Techniques for Music Generation , 2019 .