Compositional Steering of Music Transformers

Musical composition is a combinatorial art where composers extend sequences by choosing from a vast set of possible feature combinations that yield the compositions their distinctive qualities. Increasingly, composers are using generative models, such as music transformers, for crafting their pieces. Unfortunately, for composers to “steer” these models to satisfy their qualitative features typically requires retraining (which can be prohibitively expensive); further, existing models are unable to deal with arbitrary combinations of features at scale. In this paper we build on lightweight fine-tuning methods, such as prefix tuning and bias tuning, to propose a novel contrastive loss that enables us to steer music transformers over arbitrary combinations of logical features, with a relatively small number of extra parameters. We provide both quantitative and qualitative evaluations of our method which demonstrate its efficacy with respect to existing methods, as well as a case-study where our method was used to compose long-form musical pieces. Musical examples are available for listening online. 1

[1]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[2]  Douwe Kiela,et al.  True Few-Shot Learning with Language Models , 2021, NeurIPS.

[3]  Antoine Liutkus,et al.  Relative Positional Encoding for Transformers with Linear Complexity , 2021, ICML.

[4]  Yi-Hsuan Yang,et al.  MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just One Transformer VAE , 2021, ArXiv.

[5]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[6]  Ching Lam Choi,et al.  DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yi-Hsuan Yang,et al.  Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs , 2021, AAAI.

[8]  Rethinking Attention with Performers , 2020, ICLR.

[9]  Youngjung Uh,et al.  Rethinking the Truly Unsupervised Image-to-Image Translation , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[11]  Attributes-Aware Deep Music Transformation , 2020, ISMIR.

[12]  Cheng-Zhi Anna Huang,et al.  AI Song Contest: Human-AI Co-Creation in Songwriting , 2020, ArXiv.

[13]  Alan F. Smeaton,et al.  Contrastive Representation Learning: A Framework and Review , 2020, IEEE Access.

[14]  Gus Xia,et al.  Learning Interpretable Representation for Controllable Polyphonic Music Generation , 2020, ISMIR.

[15]  Alexei A. Efros,et al.  Contrastive Learning for Unpaired Image-to-Image Translation , 2020, ECCV.

[16]  Dorien Herremans,et al.  Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling , 2020, ISMIR.

[17]  Jaesik Park,et al.  ContraGAN: Contrastive Learning for Conditional Image Generation , 2020, Neural Information Processing Systems.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[20]  Cheng-Zhi Anna Huang,et al.  Novice-AI Music Co-Creation via AI-Steering Tools for Deep Generative Models , 2020, CHI.

[21]  Jesse Engel,et al.  Encoding Musical Style with Transformer Autoencoders , 2019, ICML.

[22]  J. Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.

[23]  Lauren Wilcox,et al.  "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making , 2019, Proc. ACM Hum. Comput. Interact..

[24]  Leonidas Guibas,et al.  Side-Tuning: Network Adaptation via Additive Side Networks , 2019, ArXiv.

[25]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[26]  Leon Hong,et al.  Approachable Music Composition with Machine Learning at Scale , 2019, ISMIR.

[27]  Marc'Aurelio Ranzato,et al.  Task-Driven Modular Networks for Zero-Shot Compositional Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Douglas Eck,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[29]  Alexia Jolicoeur-Martineau,et al.  The relativistic discriminator: a key element missing from standard GAN , 2018, ICLR.

[30]  Douglas Eck,et al.  Music Transformer , 2018, 1809.04281.

[31]  Bob L. Sturm,et al.  Machine learning research that matters for music creation: A case study , 2018, Journal of New Music Research.

[32]  Douglas Eck,et al.  This time with feeling: learning expressive musical performance , 2018, Neural Computing and Applications.

[33]  Alex Sherstinsky,et al.  Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network , 2018, Physica D: Nonlinear Phenomena.

[34]  Karen Simonyan,et al.  The challenge of realistic music generation: modelling raw audio at scale , 2018, NeurIPS.

[35]  Colin Raffel,et al.  Learning a Latent Space of Multitrack Measures , 2018, ArXiv.

[36]  Colin Raffel,et al.  A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music , 2018, ICML.

[37]  Douglas Eck,et al.  Counterpoint by Convolution , 2019, ISMIR.

[38]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Frank Nielsen,et al.  DeepBach: a Steerable Model for Bach Chorales Generation , 2016, ICML.

[41]  Douglas Eck,et al.  Tuning Recurrent Neural Networks with Reinforcement Learning , 2016, ICLR.

[42]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[43]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[44]  Nick Collins,et al.  The SuperCollider Book , 2011 .

[45]  Dan Morris,et al.  MySong: automatic accompaniment generation for vocal melodies , 2008, CHI.

[46]  François Pachet,et al.  Musical Harmonization with Constraints: A Survey , 2004, Constraints.

[47]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[48]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[49]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .