CatNet: music source separation system with mix-audio augmentation

Music source separation (MSS) is the task of separating a music piece into individual sources, such as vocals and accompaniment. Recently, neural network based methods have been applied to address the MSS problem, and can be categorized into spectrogram and time-domain based methods. However, there is a lack of research of using complementary information of spectrogram and time-domain inputs for MSS. In this article, we propose a CatNet framework that concatenates a UNet separation branch using spectrogram as input and a WavUNet separation branch using time-domain waveform as input for MSS. We propose an end-to-end and fully differentiable system that incorporate spectrogram calculation into CatNet. In addition, we propose a novel mix-audio data augmentation method that randomly mix audio segments from the same source as augmented audio segments for training. Our proposed CatNet MSS system achieves a state-ofthe-art vocals separation source distortion ratio (SDR) of 7.54 dB, outperforming MMDenseNet of 6.57 dB evaluated on the MUSDB18 dataset.

[1]  Romain Hennequin,et al.  SPLEETER: A FAST AND STATE-OF-THE ART MUSIC SOURCE SEPARATION TOOL WITH PRE-TRAINED MODELS , 2019 .

[2]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[5]  Yi-Hsuan Yang,et al.  Denoising Auto-Encoder with Recurrent Skip Connections and Residual Regression for Music Source Separation , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[6]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Naoya Takahashi,et al.  Multi-Scale multi-band densenets for audio source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Marc Leman,et al.  Content-Based Music Information Retrieval: Current Directions and Future Challenges , 2008, Proceedings of the IEEE.

[10]  Changshui Zhang,et al.  Unsupervised Single-Channel Music Source Separation by Average Harmonic Structure Modeling , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Jonathan Le Roux,et al.  Discriminative NMF and its application to single-channel source separation , 2014, INTERSPEECH.

[12]  Fabian-Robert Stöter,et al.  Open-Unmix - A Reference Implementation for Music Source Separation , 2019, J. Open Source Softw..

[13]  M.E. Davies,et al.  Source separation using single channel ICA , 2007, Signal Process..

[14]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Balaji Thoshkahna,et al.  Voice and accompaniment separation in music using self-attention convolutional neural network , 2020, ArXiv.

[16]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Antoine Liutkus,et al.  An Overview of Lead and Accompaniment Separation in Music , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Hakan Erdogan,et al.  Deep neural networks for single channel source separation , 2013, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Fabian-Robert Stöter,et al.  MUSDB18 - a corpus for music separation , 2017 .

[20]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[21]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Naoya Takahashi,et al.  PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation , 2018, INTERSPEECH.

[25]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[26]  Romain Hennequin,et al.  Singing Voice Separation: A Study on Training Data , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Nicolas Usunier,et al.  Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed , 2019, ArXiv.

[28]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.