An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation

Music relies heavily on self-reference to build structure and meaning. We explore the Transformer architecture [27] as a generative model for music, as self-attention has shown compelling results on tasks that require long-term structure such as Wikipedia summary generation [18]. However, timing information is critical for polyphonic music, and Transformer does not explicitly model absolute or relative timing in its structure. To address this challenge, Shaw et al. [22] introduced relative position representations to self-attention to improve machine translation. However, the formulation was not scalable to longer sequences. We propose an improved formulation which reduces the memory requirements of the relative position computation from O(ld) to O(ld), where l is the length of sequences and d is the hidden size, making it possible to train much longer sequences and achieve faster convergence. In experiments on symbolic music we find that relative selfattention substantially improves sample quality for unconditioned generation and is able to generate sequences of lengths longer than those from the training set. When primed with an initial sequence, the model generates continuations that develop the prime coherently and exhibit long-term structure. Relative self-attention can be instrumental in capturing richer relationships within a musical piece 2 3.

[1]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[2]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[3]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Bernd Schöner,et al.  Analysis and Synthesis of Palestrina-Style Counterpoint Using Markov Chains , 2001, ICMC.

[6]  Jürgen Schmidhuber,et al.  Finding temporal structure in music: blues improvisation with LSTM recurrent networks , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[7]  Christopher K. I. Williams,et al.  Harmonising Chorales by Probabilistic Inference , 2004, NIPS.

[8]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[11]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[12]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[13]  Hugo Larochelle,et al.  A Deep and Tractable Density Estimator , 2013, ICML.

[14]  Kratarth Goel,et al.  Polyphonic Music Generation by Modeling Temporal Dependencies Using a RNN-DBN , 2014, ICANN.

[15]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[16]  Hugo Larochelle,et al.  Neural Autoregressive Distribution Estimation , 2016, J. Mach. Learn. Res..

[17]  Gerhard Widmer,et al.  Imposing higher-level Structure in Polyphonic Music Generation using Convolutional Restricted Boltzmann Machines and Constraints , 2016, ArXiv.

[18]  Gaëtan Hadjeres,et al.  Style Imitation and Chord Invention in Polyphonic Music with Exponential Families , 2016, ArXiv.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Yi-Hsuan Yang,et al.  MuseGAN: Symbolic-domain Music Generation and Accompaniment with Multi-track Sequential Generative Adversarial Networks , 2017, ArXiv.

[21]  Douglas Eck,et al.  Counterpoint by Convolution , 2019, ISMIR.

[22]  Frank Nielsen,et al.  DeepBach: a Steerable Model for Bach Chorales Generation , 2016, ICML.

[23]  Douglas Eck,et al.  This time with feeling: learning expressive musical performance , 2018, Neural Computing and Applications.

[24]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[25]  Ke Li,et al.  A Time-Restricted Self-Attention Layer for ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.