论文信息 - Learning source-aware representations of music in a discrete latent space

Learning source-aware representations of music in a discrete latent space

In recent years, neural network based methods have been proposed as a method that can generate representations from music, but they are not human readable and hardly analyzable or editable by a human. To address this issue, we propose a novel method to learn source-aware latent representations of music through Vector-Quantized Variational Auto-Encoder(VQ-VAE). We train our VQ-VAE to encode an input mixture into a tensor of integers in a discrete latent space, and design them to have a decomposed structure which allows humans to manipulate the latent vector in a source-aware manner. This paper also shows that we can generate bass lines by estimating latent vectors in a discrete space.

Jinsung Kim | Soonyoung Jung | Woosung Choi | Jaehwa Chung | Yeong-Seok Jeong

[1] Fabian-Robert Stöter,et al. MUSDB18 - a corpus for music separation , 2017 .

[2] Juan Pablo Bello,et al. Adversarial Learning for Improved Onsets and Frames Music Transcription , 2019, ISMIR.

[3] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[5] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[6] Francis Bach,et al. Music Source Separation in the Waveform Domain , 2019, ArXiv.

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Tie-Yan Liu,et al. PopMAG: Pop Music Accompaniment Generation , 2020, ACM Multimedia.

[9] Jen-Tzung Chien,et al. Variational Recurrent Neural Networks for Speech Separation , 2017, INTERSPEECH.

[10] Soonyoung Jung,et al. Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[12] Aäron van den Oord,et al. Multi-Format Contrastive Learning of Audio Representations , 2021, ArXiv.

[13] Aditya Ganeshan,et al. Meta-Learning Extractors for Music Source Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Cong Zhou,et al. High-quality Speech Coding with Sample RNN , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Joon Son Chung,et al. The Sound of My Voice: Speaker Representation Loss for Target Voice Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Hirokazu Kameoka,et al. Supervised Determined Source Separation with Multichannel Variational Autoencoder , 2019, Neural Computation.

[17] Olof Mogren,et al. Adversarial representation learning for private speech generation , 2020, ArXiv.

[18] Tillman Weyde,et al. Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[19] Andreas Stolcke,et al. Wav2vec-C: A Self-supervised Model for Speech Representation Learning , 2021, Interspeech.

[20] Masashi Unoki,et al. A Skip Attention Mechanism for Monaural Singing Voice Separation , 2019, IEEE Signal Processing Letters.

[21] Jonathan Le Roux,et al. Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision , 2020, ArXiv.

[22] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[23] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.