Singing Voice Separation using Generative Adversarial Networks

In this paper, we propose a novel approach extending Wasserstein generative adversarial networks (GANs) [3] to separate singing voice from the mixture signal. We used the mixture signal as a condition to generate singing voices and applied the U-net style network for the stable training of the model. Experiments with the DSD100 dataset show the promising results with the potential of using the GANs for music source separation.

[1]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[2]  Olaf Ronneberger,et al.  Invited Talk: U-Net Convolutional Networks for Biomedical Image Segmentation , 2017, Bildverarbeitung für die Medizin.

[3]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[5]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[6]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Jacob D. Abernethy,et al.  How to Train Your DRAGAN , 2017, ArXiv.

[8]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.