Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation

Recent deep-learning approaches have shown that Frequency Transformation (FT) blocks can significantly improve spectrogram-based single-source separation models by capturing frequency patterns. The goal of this paper is to extend the FT block to fit the multi-source task. We propose the Latent Source Attentive Frequency Transformation (LaSAFT) block to capture source-dependent frequency patterns. We also propose the Gated Point-wise Convolutional Modulation (GPoCM), an extension of Feature-wise Linear Modulation (FiLM), to modulate internal features. By employing these two novel methods, we extend the Conditioned-U-Net (CUNet) for multi-source separation, and the experimental results indicate that our LaSAFT and GPoCM can improve the CUNet’s performance, achieving state-of-the-art SDR performance on several MUSDB18 source separation tasks.

[1]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[3]  Fabian-Robert Stöter,et al.  MUSDB18 - a corpus for music separation , 2017 .

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Gabriel Meseguer-Brocal,et al.  Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for Multiple Source Separations , 2019, ISMIR.

[6]  Aditya Ganeshan,et al.  Meta-Learning Extractors for Music Source Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yossi Adi,et al.  Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[8]  Naoya Takahashi,et al.  Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[9]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[10]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[11]  Jen-Yu Liu,et al.  Dilated Convolution with Dilated GRU for Music Source Separation , 2019, IJCAI.

[12]  Zhiwei Xiong,et al.  PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.

[13]  Naoya Takahashi,et al.  D3Net: Densely connected multidilated DenseNet for music source separation , 2020, ArXiv.

[14]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[19]  Minseok Kim,et al.  Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation , 2019, ISMIR.