Encoding Musical Style with Transformer Autoencoders

We consider the problem of learning high-level controls over the global structure of generated sequences, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer autoencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance. We show it is possible to combine this global representation with other temporally distributed embeddings, enabling improved control over the separate aspects of performance style and melody. Empirically, we demonstrate the effectiveness of our method on various music generation tasks on the MAESTRO dataset and a YouTube dataset with 10,000+ hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to baselines.

[1]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[2]  Sageev Oore,et al.  Exploring Conditioning for Generative Music Systems with Human-Interpretable Controls , 2019, ICCC.

[3]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[4]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[5]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[6]  Kumar Krishna Agrawal,et al.  GANSynth: Adversarial Neural Audio Synthesis , 2019, ICLR.

[7]  Geoffrey E. Hinton Reducing the Dimensionality of Data with Neural , 2008 .

[8]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[9]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[10]  Colin Raffel,et al.  A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music , 2018, ICML.

[11]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[12]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[13]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Xiaojun Wan,et al.  T-CVAE: Transformer-Based Conditioned Variational Autoencoder for Story Completion , 2019, IJCAI.

[16]  Brian Christopher Smith,et al.  Query by humming: musical information retrieval in an audio database , 1995, MULTIMEDIA '95.

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Douglas Eck,et al.  Counterpoint by Convolution , 2019, ISMIR.

[20]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[21]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[22]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[23]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[24]  Adam Roberts,et al.  Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models , 2017, ICLR.

[25]  Samy Bengio,et al.  Discrete Autoencoders for Sequence Models , 2018, ArXiv.

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Lior Wolf,et al.  A Universal Music Translation Network , 2018, ICLR.

[28]  Alexander Lerch,et al.  On the evaluation of generative models in music , 2018, Neural Computing and Applications.

[29]  Andrew M. Dai,et al.  Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.

[30]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[31]  Douglas Eck,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[32]  Thomas Paine,et al.  Few-shot Autoregressive Density Estimation: Towards Learning to Learn Distributions , 2017, ICLR.

[33]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[34]  Douglas Eck,et al.  This time with feeling: learning expressive musical performance , 2018, Neural Computing and Applications.

[35]  Thierry Bertin-Mahieux,et al.  Automatic Tagging of Audio: The State-of-the-Art , 2011 .

[36]  Douglas Eck,et al.  Learning to Groove with Inverse Sequence Transformations , 2019, ICML.

[37]  Yi-Hsuan Yang,et al.  Improving Automatic Jazz Melody Generation by Transfer Learning Techniques , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[38]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[39]  Kilian Q. Weinberger,et al.  ISMIR 2008 – Session 3a – Content-Based Retrieval, Categorization and Similarity 1 LEARNING A METRIC FOR MUSIC SIMILARITY , 2022 .