Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation

In deep neural networks with convolutional layers, all the neurons in each layer typically have the same size receptive fields (RFs) with the same resolution. Convolutional layers with neurons that have large RF capture global information from the input features, while layers with neurons that have small RF size capture local details with high resolution from the input features. In this work, we introduce novel deep multi-resolution fully convolutional neural networks (MR-FCN), where each layer has a range of neurons with different RF sizes to extract multi-resolution features that capture the global and local information from its input features. The proposed MR-FCN is applied to separate the singing voice from mixtures of music sources. Experimental results show that using MR-FCN improves the performance compared to feedforward deep neural networks (DNNs) and single resolution deep fully convolutional neural networks (FCNs) on the audio source separation problem.

[1]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Paris Smaragdis,et al.  Neural network alternatives toconvolutive audio models for source separation , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[3]  Masakiyo Fujimoto,et al.  Exploiting spectro-temporal locality in deep learning based acoustic event detection , 2015, EURASIP J. Audio Speech Music. Process..

[4]  H. Keselman,et al.  Multiple Comparison Procedures , 2005 .

[5]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[6]  Hakan Erdogan,et al.  Audio-visual speech recognition with background music using single-channel source separation , 2012, 2012 20th Signal Processing and Communications Applications Conference (SIU).

[7]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[8]  Weichen Xue,et al.  Encoding Multi-resolution Two-Stream CNNs for Action Recognition , 2016, ICONIP.

[9]  Anssi Klapuri,et al.  Signal Processing Methods for Music Transcription , 2006 .

[10]  Hakan Erdogan,et al.  Source separation using regularized NMF with MMSE estimates under GMM priors with online learning for the uncertainties , 2013, Digit. Signal Process..

[11]  Xuan Zeng,et al.  HeartID: A Multiresolution Convolutional Neural Network for ECG-Based Biometric Human Identification in Smart Health Applications , 2017, IEEE Access.

[12]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[13]  Abdel-rahman Mohamed,et al.  Multiresolution Deep Belief Networks , 2012, AISTATS.

[14]  DeLiang Wang,et al.  Deep Ensemble Learning for Monaural Speech Separation , 2015 .

[15]  DeLiang Wang,et al.  A structure-preserving training target for supervised speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Gerald Schuller,et al.  A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[17]  Francesco Visin,et al.  A guide to convolution arithmetic for deep learning , 2016, ArXiv.

[18]  Ghassan Hamarneh,et al.  Multi-resolution-Tract CNN with Hybrid Pretrained and Skin-Lesion Trained Layers , 2016, MLMI@MICCAI.

[19]  Dong Wang,et al.  Music removal by convolutional denoising autoencoder in speech recognition , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[20]  Emilia Gómez,et al.  Monaural Score-Informed Source Separation for Classical Music Using Convolutional Neural Networks , 2017, ISMIR.

[21]  Mark D. Plumbley,et al.  Single Channel Audio Source Separation using Deep Neural Network Ensembles , 2016 .

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Paris Smaragdis,et al.  End-To-End Source Separation With Adaptive Front-Ends , 2017, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[24]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Andrew J. R. Simpson Time-Frequency Trade-offs for Audio Source Separation with Binary Masks , 2015, ArXiv.

[26]  Mark D. Plumbley,et al.  Combining Mask Estimates for Single Channel Audio Source Separation Using Deep Neural Networks , 2016, INTERSPEECH.

[27]  Emilia Gómez,et al.  Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[28]  Wootaek Lim,et al.  Harmonic and percussive source separation using a convolutional auto encoder , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[29]  A. Tamhane,et al.  Multiple Comparison Procedures , 1989 .

[30]  Mark D. Plumbley,et al.  Single channel audio source separation using convolutional denoising autoencoders , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[31]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[33]  Mark D. Plumbley,et al.  Two-Stage Single-Channel Audio Source Separation Using Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Hakan Erdogan,et al.  Spectro-temporal post-enhancement using MMSE estimation in NMF based single-channel source separation , 2013, INTERSPEECH.

[35]  Babak Nasersharif,et al.  Multiresolution convolutional neural network for robust speech recognition , 2017, 2017 Iranian Conference on Electrical Engineering (ICEE).

[36]  Mark D. Plumbley,et al.  Discriminative Enhancement for Single Channel Audio Source Separation Using Deep Neural Networks , 2016, LVA/ICA.