Learning source-aware representations of music in a discrete latent space

In recent years, neural network based methods have been proposed as a method that can generate representations from music, but they are not human readable and hardly analyzable or editable by a human. To address this issue, we propose a novel method to learn source-aware latent representations of music through Vector-Quantized Variational Auto-Encoder(VQ-VAE). We train our VQ-VAE to encode an input mixture into a tensor of integers in a discrete latent space, and design them to have a decomposed structure which allows humans to manipulate the latent vector in a source-aware manner. This paper also shows that we can generate bass lines by estimating latent vectors in a discrete space.

[1]  Fabian-Robert Stöter,et al.  MUSDB18 - a corpus for music separation , 2017 .

[2]  Juan Pablo Bello,et al.  Adversarial Learning for Improved Onsets and Frames Music Transcription , 2019, ISMIR.

[3]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[6]  Francis Bach,et al.  Music Source Separation in the Waveform Domain , 2019, ArXiv.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Tie-Yan Liu,et al.  PopMAG: Pop Music Accompaniment Generation , 2020, ACM Multimedia.

[9]  Jen-Tzung Chien,et al.  Variational Recurrent Neural Networks for Speech Separation , 2017, INTERSPEECH.

[10]  Soonyoung Jung,et al.  Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[12]  Aäron van den Oord,et al.  Multi-Format Contrastive Learning of Audio Representations , 2021, ArXiv.

[13]  Aditya Ganeshan,et al.  Meta-Learning Extractors for Music Source Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Cong Zhou,et al.  High-quality Speech Coding with Sample RNN , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Joon Son Chung,et al.  The Sound of My Voice: Speaker Representation Loss for Target Voice Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Hirokazu Kameoka,et al.  Supervised Determined Source Separation with Multichannel Variational Autoencoder , 2019, Neural Computation.

[17]  Olof Mogren,et al.  Adversarial representation learning for private speech generation , 2020, ArXiv.

[18]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[19]  Andreas Stolcke,et al.  Wav2vec-C: A Self-supervised Model for Speech Representation Learning , 2021, Interspeech.

[20]  Masashi Unoki,et al.  A Skip Attention Mechanism for Monaural Singing Voice Separation , 2019, IEEE Signal Processing Letters.

[21]  Jonathan Le Roux,et al.  Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision , 2020, ArXiv.

[22]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[23]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.