Unsupervised Audio Source Separation using Generative Priors

State-of-the-art under-determined audio source separation systems rely on supervised end-end training of carefully tailored neural network architectures operating either in the time or the spectral domain. However, these methods are severely challenged in terms of requiring access to expensive source level labeled data and being specific to a given set of sources and the mixing process, which demands complete re-training when those assumptions change. This strongly emphasizes the need for unsupervised methods that can leverage the recent advances in data-driven modeling, and compensate for the lack of labeled data through meaningful priors. To this end, we propose a novel approach for audio source separation based on generative priors trained on individual sources. Through the use of projected gradient descent optimization, our approach simultaneously searches in the source-specific latent spaces to effectively recover the constituent sources. Though the generative priors can be defined in the time domain directly, e.g. WaveGAN, we find that using spectral domain loss functions for our optimization leads to good-quality source estimates. Our empirical studies on standard spoken digit and instrument datasets clearly demonstrate the effectiveness of our approach over classical as well as state-of-the-art unsupervised baselines.

[1]  Hiroshi Sawada,et al.  Audio source separation based on independent component analysis , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[2]  Mark D. Plumbley,et al.  Raw Multi-Channel Audio Source Separation using Multi- Resolution Convolutional Auto-Encoders , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[3]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[5]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[6]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Nicolas Usunier,et al.  SING: Symbol-to-Instrument Neural Generator , 2018, NeurIPS.

[8]  Chenliang Xu,et al.  Preprint-work in progress , 2019 .

[9]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[10]  Emmanuel Vincent,et al.  Single-channel audio source separation with NMF: divergences, constraints and algorithms , 2018 .

[11]  Alexandros G. Dimakis,et al.  Compressed Sensing using Generative Models , 2017, ICML.

[12]  R. Howe,et al.  17th International Conference on Medical Image Computing and Computer-Assisted Intervention. , 2014, Medical image computing and computer-assisted intervention : MICCAI ... International Conference on Medical Image Computing and Computer-Assisted Intervention.

[13]  Tuomas Virtanen,et al.  Sound Source Separation Using Sparse Coding with Temporal Continuity Objective , 2003, ICMC.

[14]  Andreas Spanias,et al.  Advances in speech and audio processing and coding , 2015, 2015 6th International Conference on Information, Intelligence, Systems and Applications (IISA).

[15]  Ted Painter,et al.  Audio Signal Processing and Coding , 2007 .

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[18]  Chinmay Hegde,et al.  Solving Linear Inverse Problems Using Gan Priors: An Algorithm with Provable Guarantees , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[20]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[21]  Ren Ng,et al.  Single Image Reflection Separation with Perceptual Losses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[23]  Volker Gnann SOURCE-FILTER BASED CLUSTERING FOR MONAURAL BLIND SOURCE SEPARATION , 2009 .

[24]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[25]  Chris Donahue,et al.  Synthesizing Audio with Generative Adversarial Networks , 2018, ArXiv.

[26]  Joshua D. Reiss,et al.  Over-Determined Source Separation and Localization Using Distributed Microphones , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Xavier Serra,et al.  End-to-end music source separation: is it possible in the waveform domain? , 2018, INTERSPEECH.

[29]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[30]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[31]  Rushil Anirudh,et al.  MimicGAN: Robust Projection onto Image Manifolds with Corruption Mimicking , 2020, International Journal of Computer Vision.

[32]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[33]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[34]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[35]  Naoya Takahashi,et al.  Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[36]  Karthikeyan Natesan Ramamurthy,et al.  Mixing matrix estimation using discriminative clustering for blind source separation , 2013, Digit. Signal Process..

[37]  Nicolas Usunier,et al.  Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed , 2019, ArXiv.

[38]  Juha Karhunen,et al.  Nonlinear PCA type approaches for source separation and independent component analysis , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.