Multi-Scale Temporal-Frequency Attention for Music Source Separation

In recent years, deep neural networks (DNNs) based approaches have achieved the start-of-the-art performance for music source separation (MSS). Although previous methods have addressed the large receptive field modeling using various methods, the temporal and frequency correlations of the music spectrogram with repeated patterns have not been explicitly explored for the MSS task. In this paper, a temporal-frequency attention module is proposed to model the spectrogram correlations along both temporal and frequency dimensions. Moreover, a multi-scale attention is proposed to effectively capture the correlations for music signal. The experimental results on MUSDB18 dataset show that the proposed method outperforms the existing state-of-the-art systems with 9.51 dB signal-to-distortion ratio (SDR) on separating the vocal stems, which is the primary practical application of MSS.

[1]  Soonyoung Jung,et al.  KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing , 2021, ArXiv.

[2]  Alexandre D'efossez Hybrid Spectrogram and Waveform Source Separation , 2021, ArXiv.

[3]  Qiuqiang Kong,et al.  Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation , 2021, ISMIR.

[4]  S. Uhlich,et al.  Music Demixing Challenge 2021 , 2021, Frontiers in Signal Processing.

[5]  Xiulian Peng,et al.  Interactive Speech and Noise Modeling for Speech Enhancement , 2020, AAAI.

[6]  Ming Li,et al.  Sams-Net: A Sliced Attention-based Neural Network for Music Source Separation , 2019, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[7]  Naoya Takahashi,et al.  D3Net: Densely connected multidilated DenseNet for music source separation , 2020, ArXiv.

[8]  Lei Xie,et al.  Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music , 2020, INTERSPEECH.

[9]  Romain Hennequin,et al.  Spleeter: a fast and efficient music source separation tool with pre-trained models , 2020, J. Open Source Softw..

[10]  Balaji Thoshkahna,et al.  Voice and accompaniment separation in music using self-attention convolutional neural network , 2020, ArXiv.

[11]  Minseok Kim,et al.  Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation , 2019, ISMIR.

[12]  Francis Bach,et al.  Music Source Separation in the Waveform Domain , 2019, ArXiv.

[13]  Fabian-Robert Stöter,et al.  Open-Unmix - A Reference Implementation for Music Source Separation , 2019, J. Open Source Softw..

[14]  Jen-Yu Liu,et al.  Dilated Convolution with Dilated GRU for Music Source Separation , 2019, IJCAI.

[15]  Romain Hennequin,et al.  Singing Voice Separation: A Study on Training Data , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[17]  Naoya Takahashi,et al.  Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[18]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[19]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[20]  Naoya Takahashi,et al.  Multi-Scale multi-band densenets for audio source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Emmanuel Vincent,et al.  Multichannel music separation with deep neural networks , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[24]  Franck Giron,et al.  Deep neural network based instrument extraction from music , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Derry FitzGerald,et al.  Upmixing from mono - A source separation approach , 2011, 2011 17th International Conference on Digital Signal Processing (DSP).

[27]  Anssi Klapuri,et al.  State of the Art Report: Audio-Based Music Structure Analysis , 2010, ISMIR.

[28]  M.E. Davies,et al.  Source separation using single channel ICA , 2007, Signal Process..

[29]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.