Learning to Groove with Inverse Sequence Transformations

We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using Seq2Seq and recurrent Variational Information Bottleneck (VIB) models. Though Seq2Seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix (Isola et al., 2017) and Vid2Vid (Wang et al. 2018a)) to sequences, creating large volumes of paired data by performing simple transformations and training generative models to plausibly invert these transformations. Music, and drumming in particular, provides a strong test case for this approach because many common transformations (quantization, removing voices) have clear semantics, and models for learning to invert them have real-world applications. Focusing on the case of drum set players, we create and release a new dataset for this purpose, containing over 13 hours of recordings by professional drummers aligned with fine-grained timing and dynamics information. We also explore some of the creative potential of these models, including demonstrating improvements on state-of-the-art methods for Humanization (instantiating a performance from a musical score).

[1]  Sang Won Yoon,et al.  The Effects of Unilateral Tinnitus on Auditory Temporal Resolution: Gaps-In-Noise Performance , 2014, Korean journal of audiology.

[2]  Yuxuan Wang,et al.  Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yiannis Demiris,et al.  A Groovy Virtual Drumming Agent , 2009, IVA.

[4]  Douglas Eck,et al.  This time with feeling: learning expressive musical performance , 2018, Neural Computing and Applications.

[5]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[6]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7]  C Muchnik,et al.  Minimal time interval in auditory temporal resolution. , 1985, The Journal of auditory research.

[8]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[9]  Colin Raffel,et al.  Thermometer Encoding: One Hot Way To Resist Adversarial Examples , 2018, ICLR.

[10]  Colin Raffel,et al.  A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music , 2018, ICML.

[11]  Olivier Senn,et al.  Groove in drum patterns as a function of both rhythmic properties and listeners’ attitudes , 2018, PloS one.

[12]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[13]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[14]  R. Keller,et al.  JazzGAN : Improvising with Generative Adversarial Networks , 2018 .

[15]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[16]  Matthew Wright,et al.  Towards Machine Learning of Expressive Microtiming in Brazilian Drumming , 2006, ICMC.

[17]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[18]  Guy Madison,et al.  Quantifying Microtiming Patterning and Variability in Drum Kit Recordings , 2015 .

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Colin Raffel,et al.  Learning a Latent Space of Multitrack Measures , 2018, ArXiv.

[21]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[22]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  J. Nikhil,et al.  Temporal Resolution and Active Auditory Discrimination Skill in Vocal Musicians , 2015, International Archives of Otorhinolaryngology.

[24]  C. Raphael,et al.  Modeling Piano Interpretation Using Switching Kalman Filter , 2012, ISMIR.

[25]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Douglas Eck,et al.  Counterpoint by Convolution , 2019, ISMIR.

[27]  Yiannis Demiris,et al.  Groovy Neural Networks , 2008, ECAI.

[28]  Yiannis Demiris,et al.  Imitating the Groove: Making Drum Machines more Human , 2007 .

[29]  Yupeng Gu,et al.  Creating Expressive Piano Performance Using A Low-Dimensional Performance Model , 2013 .

[30]  Yuxuan Wang,et al.  Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[31]  Douglas Eck,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[32]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.