论文信息 - MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel recurrent neural approach that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.

[1] Christopher Joseph Pal,et al. Twin Networks: Matching the Future for Sequence Generation , 2017, ICLR.

[2] Emilia Gómez,et al. Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[3] Gerald Schuller,et al. A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[4] Franck Giron,et al. Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Harri Valpola,et al. Denoising Source Separation , 2005, J. Mach. Learn. Res..

[6] Yoshua Bengio,et al. Z-Forcing: Training Stochastic Recurrent Networks , 2017, NIPS.

[7] Tillman Weyde,et al. Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[8] Stephen D. Voran. The selection of spectral magnitude exponents for separating two sources is dominated by phase distribution not magnitude distribution , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[9] Gerald Schuller,et al. New Sonorities for Jazz Recordings: Separation and Mixing using Deep Neural Networks , 2016 .

[10] Pascal Vincent,et al. Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[11] Antoine Liutkus,et al. Generalized Wiener filtering with fractional power spectrograms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[13] Mark D. Plumbley,et al. Single Channel Audio Source Separation using Deep Neural Network Ensembles , 2016 .

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Antoine Liutkus,et al. Common fate model for unison source separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Paris Smaragdis,et al. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17] Naoya Takahashi,et al. Multi-Scale multi-band densenets for audio source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[18] Mark D. Plumbley,et al. On the disjointess of sources in music using different time-frequency representations , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[19] Jonathan Le Roux,et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Antoine Liutkus,et al. The 2016 Signal Separation Evaluation Campaign , 2017, LVA/ICA.

[21] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[23] Kyogu Lee,et al. Singing Voice Separation Using RPCA with Weighted l_1 -norm , 2017, LVA/ICA.

[24] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[25] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[26] Franck Giron,et al. Deep neural network based instrument extraction from music , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Derry Fitzgerald,et al. ON THE USE OF MASKING FILTERS IN SOUND SOURCE SEPARATION , 2012 .

[28] Jürgen Schmidhuber,et al. Highway Networks , 2015, ArXiv.

[29] Matthias Mauch,et al. MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[30] Yoshua Bengio,et al. Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[32] Yoshua Bengio,et al. Measuring the tendency of CNNs to Learn Surface Statistical Regularities , 2017, ArXiv.

[33] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[34] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.