论文信息 - Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation

Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation

Recent deep-learning approaches have shown that Frequency Transformation (FT) blocks can significantly improve spectrogram-based single-source separation models by capturing frequency patterns. The goal of this paper is to extend the FT block to fit the multi-source task. We propose the Latent Source Attentive Frequency Transformation (LaSAFT) block to capture source-dependent frequency patterns. We also propose the Gated Point-wise Convolutional Modulation (GPoCM), an extension of Feature-wise Linear Modulation (FiLM), to modulate internal features. By employing these two novel methods, we extend the Conditioned-U-Net (CUNet) for multi-source separation, and the experimental results indicate that our LaSAFT and GPoCM can improve the CUNet’s performance, achieving state-of-the-art SDR performance on several MUSDB18 source separation tasks.

[1] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[3] Fabian-Robert Stöter,et al. MUSDB18 - a corpus for music separation , 2017 .

[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[5] Gabriel Meseguer-Brocal,et al. Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for Multiple Source Separations , 2019, ISMIR.

[6] Aditya Ganeshan,et al. Meta-Learning Extractors for Music Source Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Yossi Adi,et al. Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[8] Naoya Takahashi,et al. Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[9] Tillman Weyde,et al. Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[10] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[11] Jen-Yu Liu,et al. Dilated Convolution with Dilated GRU for Music Source Separation , 2019, IJCAI.

[12] Zhiwei Xiong,et al. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.

[13] Naoya Takahashi,et al. D3Net: Densely connected multidilated DenseNet for music source separation , 2020, ArXiv.

[14] Franck Giron,et al. Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[19] Minseok Kim,et al. Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation , 2019, ISMIR.