MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel recurrent neural approach that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.

[1]  Christopher Joseph Pal,et al.  Twin Networks: Matching the Future for Sequence Generation , 2017, ICLR.

[2]  Emilia Gómez,et al.  Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[3]  Gerald Schuller,et al.  A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[4]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Harri Valpola,et al.  Denoising Source Separation , 2005, J. Mach. Learn. Res..

[6]  Yoshua Bengio,et al.  Z-Forcing: Training Stochastic Recurrent Networks , 2017, NIPS.

[7]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[8]  Stephen D. Voran The selection of spectral magnitude exponents for separating two sources is dominated by phase distribution not magnitude distribution , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[9]  Gerald Schuller,et al.  New Sonorities for Jazz Recordings: Separation and Mixing using Deep Neural Networks , 2016 .

[10]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[11]  Antoine Liutkus,et al.  Generalized Wiener filtering with fractional power spectrograms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[13]  Mark D. Plumbley,et al.  Single Channel Audio Source Separation using Deep Neural Network Ensembles , 2016 .

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Antoine Liutkus,et al.  Common fate model for unison source separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Naoya Takahashi,et al.  Multi-Scale multi-band densenets for audio source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[18]  Mark D. Plumbley,et al.  On the disjointess of sources in music using different time-frequency representations , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[19]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Antoine Liutkus,et al.  The 2016 Signal Separation Evaluation Campaign , 2017, LVA/ICA.

[21]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[23]  Kyogu Lee,et al.  Singing Voice Separation Using RPCA with Weighted l_1 -norm , 2017, LVA/ICA.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[26]  Franck Giron,et al.  Deep neural network based instrument extraction from music , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Derry Fitzgerald,et al.  ON THE USE OF MASKING FILTERS IN SOUND SOURCE SEPARATION , 2012 .

[28]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[29]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[30]  Yoshua Bengio,et al.  Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[32]  Yoshua Bengio,et al.  Measuring the tendency of CNNs to Learn Surface Statistical Regularities , 2017, ArXiv.

[33]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[34]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.