Visualizing Music Transformer

Like language, music can be represented as a sequence of discrete symbols that form 1 a hierarchical syntax, with notes being roughly like characters and motifs of notes 2 like words. Unlike text, music relies heavily on repetition on multiple timescales 3 to build structure and meaning. Music Transformer has shown compelling results 4 in generating music with structure. How does the model capture motifs and build 5 phrasing? In this paper, we introduce a tool for visualizing self-attention on 6 polyphonic music with an interactive pianoroll. We use music transformer as both 7 a descriptive tool and a generative model. For the former, we use it to analyze 8 existing music to see if the resulting self-attention structure corroborates with the 9 musical structure known from music theory. For the latter, we inspect the model’s 10 self-attention during generation, in order to understand how past notes affect future 11 ones. As relative self-attention has been shown to be particularly useful for music, 12 we compare and contrast the attention structure of regular attention to that of 13 relative attention, and examine its impact on the resulting generated music. For 14 example, for the JSB Chorales dataset, a model trained with relative attention is 15 more consistent in attending to all the voices in the preceding timestep and the 16 chords before, and at cadences to the beginning of a phrase, allowing it to create an 17 arc. In contrast, for regular attention we often see the attention “shrink” to focusing 18 mainly on the preceding two or three events, resulting in a certain voice repeating 19 the same note for a long duration, perhaps due to overconfidence. 20 Submitted to 32nd Conference on Neural Information Processing Systems (NIPS 2018). Do not distribute.