Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

Deep neural network based methods have been successfully applied to music source separation. They typically learn a mapping from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-frequency bins have ideal ratio mask values of over~1 in a popular dataset, MUSDB18, 3) its potential on very deep architectures is under-explored. Our proposed system is designed to overcome these. First, we propose to estimate phases by estimating complex ideal ratio masks (cIRMs) where we decouple the estimation of cIRMs into magnitude and phase estimations. Second, we extend the separation method to effectively allow the magnitude of the mask to be larger than 1. Finally, we propose a residual UNet architecture with up to 143 layers. Our proposed system achieves a state-of-the-art MSS result on the MUSDB18 dataset, especially, a SDR of 8.98~dB on vocals, outperforming the previous best performance of 7.24~dB. The source code is available at: this https URL

[1]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[5]  Naoya Takahashi,et al.  D3Net: Densely connected multidilated DenseNet for music source separation , 2020, ArXiv.

[6]  Changshui Zhang,et al.  Unsupervised Single-Channel Music Source Separation by Average Harmonic Structure Modeling , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Jordi Janer,et al.  Remixing music using source separation algorithms to improve the musical experience of cochlear implant users. , 2016, The Journal of the Acoustical Society of America.

[8]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  M.E. Davies,et al.  Source separation using single channel ICA , 2007, Signal Process..

[11]  Xavier Serra,et al.  End-to-end music source separation: is it possible in the waveform domain? , 2018, INTERSPEECH.

[12]  Gaël Richard,et al.  A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[13]  Ke Tan,et al.  Complex Spectral Mapping with a Convolutional Recurrent Network for Monaural Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Lei Xie,et al.  Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music , 2020, INTERSPEECH.

[15]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Naoya Takahashi,et al.  Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[17]  Fabian-Robert Stöter,et al.  Open-Unmix - A Reference Implementation for Music Source Separation , 2019, J. Open Source Softw..

[18]  Tom Barker,et al.  Low latency sound source separation using convolutional recurrent neural networks , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[19]  Qiuqiang Kong,et al.  CatNet: music source separation system with mix-audio augmentation , 2021, ArXiv.

[20]  Rémi Gribonval,et al.  Sparse Representations in Audio and Music: From Coding to Source Separation , 2010, Proceedings of the IEEE.

[21]  Naoya Takahashi,et al.  PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation , 2018, INTERSPEECH.

[22]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[23]  Simon Dixon,et al.  Automatic Music Transcription: An Overview , 2019, IEEE Signal Processing Magazine.

[24]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[25]  Daniel P. W. Ellis,et al.  Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[26]  Romain Hennequin,et al.  Spleeter: a fast and efficient music source separation tool with pre-trained models , 2020, J. Open Source Softw..

[27]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[28]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[31]  Fabian-Robert Stöter,et al.  MUSDB18 - a corpus for music separation , 2017 .

[32]  Emilia Gómez,et al.  Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[33]  Lei Xie,et al.  DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement , 2020, INTERSPEECH.

[34]  Jung-Woo Ha,et al.  Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[35]  Francis Bach,et al.  Music Source Separation in the Waveform Domain , 2019, ArXiv.

[36]  Zhiwei Xiong,et al.  PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.