BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control

Deep learning has given AI-based methods for music creation a boost by over the past years. An important challenge in this field is to balance user control and autonomy in music generation systems. In this work, we present BassNet, a deep learning model for generating bass guitar tracks based on musical source material. An innovative aspect of our work is that the model is trained to learn a temporally stable two-dimensional latent space variable that offers interactive user control. We empirically show that the model can disentangle bass patterns that require sensitivity to harmony, instrument timbre, and rhythm. An ablation study reveals that this capability is because of the temporal stability constraint on latent space trajectories during training. We also demonstrate that models that are trained on pop/rock music learn a latent space that offers control over the diatonic characteristics of the output, among other things. Lastly, we present and discuss generated bass tracks for three different music fragments. The work that is presented here is a step toward the integration of AI-based technology in the workflow of musical content creators.

[1]  Jun Zhu,et al.  Modelling High-Dimensional Sequences with LSTM-RTRBM: Application to Polyphonic Music Generation , 2015, IJCAI.

[2]  Yi-Hsuan Yang,et al.  MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment , 2017, AAAI.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[5]  Geoffrey E. Hinton,et al.  Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Douglas Eck,et al.  This time with feeling: learning expressive musical performance , 2018, Neural Computing and Applications.

[7]  Roland Memisevic,et al.  Learning invariant features by harnessing the aperture problem , 2013, ICML.

[8]  Leonard M. Isaacson,et al.  Experimental music: Composition with an electronic computer , 1979 .

[9]  Olivier Sigaud,et al.  Deep unsupervised network for multimodal perception, representation and classification , 2015, Robotics Auton. Syst..

[10]  Gaetan Hadjeres,et al.  Neural Drum Machine : An Interactive System for Real-time Synthesis of Drum Sounds , 2019, ICCC.

[11]  Douwe Kiela,et al.  Learning to Negate Adjectives with Bilinear Models , 2017, EACL.

[12]  Ramón López de Mántaras,et al.  Ai and Music: From Composition to Expressive Performance , 2002, AI Mag..

[13]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[14]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[15]  Kumar Krishna Agrawal,et al.  GANSynth: Adversarial Neural Audio Synthesis , 2019, ICLR.

[16]  François Pachet,et al.  The Continuator: Musical Interaction With Style , 2003, ICMC.

[17]  Haitham Bou-Ammar,et al.  Factored four way conditional restricted Boltzmann machines for activity recognition , 2015, Pattern Recognit. Lett..

[18]  Afshin Dehghan,et al.  Who Do I Look Like? Determining Parent-Offspring Resemblance via Gated Autoencoders , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Roland Memisevic,et al.  Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells" , 2014, NIPS.

[20]  J. Nistal,et al.  DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks , 2020, ArXiv.

[21]  Jong Wook Kim,et al.  Crepe: A Convolutional Representation for Pitch Estimation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Kin Wai Cheuk,et al.  nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks , 2020, IEEE Access.

[24]  Henry Fuchs,et al.  StereoDRNet: Dilated Residual StereoNet , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[26]  Bruno A. Olshausen,et al.  Bilinear models of natural images , 2007, Electronic Imaging.

[27]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[28]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[29]  Adrien Bardet,et al.  Universal audio synthesizer control with normalizing flows , 2019, ArXiv.

[30]  Andreas Arzt,et al.  Learning Complex Basis Functions for Invariant Representations of Audio , 2019, ISMIR.

[31]  Roland Memisevic,et al.  Gradient-based learning of higher-order image features , 2011, 2011 International Conference on Computer Vision.

[32]  S. Cherla,et al.  Neural probabilistic models for melody prediction, sequence labelling and classification , 2016 .

[33]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[34]  Jamie Shotton,et al.  Automatic Stylistic Composition of Bach Chorales with Deep LSTM , 2017, ISMIR.

[35]  Frank Nielsen,et al.  Anticipation-RNN: enforcing unary constraints in sequence generation, with application to interactive music generation , 2018, Neural Computing and Applications.

[36]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[37]  Colin Raffel,et al.  A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music , 2018, ICML.

[38]  Maarten Grachten,et al.  Improving Content-Invariance in Gated Autoencoders for 2D and 3D Object Rotation , 2017, ArXiv.

[39]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[40]  Maarten Grachten,et al.  High-Level Control of Drum Track Generation Using Learned Patterns of Rhythmic Interaction , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[41]  Roland Memisevic,et al.  On multi-view feature learning , 2012, ICML.

[42]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[43]  G. Widmer,et al.  Learning Transformations of Musical Material using Gated Autoencoders , 2017 .

[44]  Garrison W. Cottrell,et al.  DeepJ: Style-Specific Music Generation , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[45]  Olivier Sigaud,et al.  Learning a repertoire of actions with deep neural networks , 2014, 4th International Conference on Development and Learning and on Epigenetic Robotics.

[46]  Gerhard Widmer,et al.  Learning Transposition-Invariant Interval Features from Symbolic Music and Audio , 2018, ArXiv.

[47]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[48]  Gerhard Widmer,et al.  Imposing higher-level Structure in Polyphonic Music Generation using Convolutional Restricted Boltzmann Machines and Constraints , 2016, ArXiv.

[49]  Yi-Hsuan Yang,et al.  Lead Sheet Generation and Arrangement by Conditional Generative Adversarial Network , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[50]  Florian Krebs,et al.  madmom: A New Python Audio and Music Signal Processing Library , 2016, ACM Multimedia.

[51]  François Pachet,et al.  Imitative Leadsheet Generation with User Constraints , 2014, ECAI.

[52]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[53]  Frank Nielsen,et al.  DeepBach: a Steerable Model for Bach Chorales Generation , 2016, ICML.

[54]  Yi-Hsuan Yang,et al.  MidiNet: A Convolutional Generative Adversarial Network for Symbolic-Domain Music Generation , 2017, ISMIR.

[55]  Jeremy Pickens,et al.  Polyphonic music modeling with random fields , 2003, MULTIMEDIA '03.

[56]  Xavier Serra,et al.  Neural Percussive Synthesis Parameterised by High-Level Timbral Features , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Trevor Darrell,et al.  Cross-Linked Variational Autoencoders for Generalized Zero-Shot Learning , 2019 .

[58]  Linlin Zong,et al.  Deep Multimodal Clustering with Cross Reconstruction , 2020, PAKDD.

[59]  Jan Schlüter,et al.  Musical Onset Detection with Convolutional Neural Networks , 2013 .

[60]  J. Schmidhuber,et al.  A First Look at Music Composition using LSTM Recurrent Neural Networks , 2002 .

[61]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[62]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[63]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[64]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[65]  Colin Raffel,et al.  Learning a Latent Space of Multitrack Measures , 2018, ArXiv.

[66]  Andrew M. Dai,et al.  Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.