Multi-Scale multi-band densenets for audio source separation

This paper deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain instrumental spectra from a mixture. In this study, we propose a novel network architecture that extends the recently developed densely connected convolutional network (DenseNet), which has shown excellent results on image classification tasks. To deal with the specific problem of audio source separation, an up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet. The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results on SiSEC 2016 competition by a large margin in terms of signal-to-distortion ratio. Moreover, the proposed architecture requires significantly fewer parameters and considerably less training time compared with other methods.

[1]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[2]  Hiroshi Saruwatari,et al.  Multichannel blind source separation based on non-negative tensor factorization in wavenumber domain , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Antoine Liutkus,et al.  The 2016 Signal Separation Evaluation Campaign , 2017, LVA/ICA.

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  Emmanuel Vincent,et al.  Multichannel music separation with deep neural networks , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[6]  Franck Giron,et al.  Deep neural network based instrument extraction from music , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[8]  Antoine Liutkus,et al.  Projection-Based Demixing of Spatial Audio , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[10]  Luc Van Gool,et al.  Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection , 2016, ArXiv.

[11]  Antoine Liutkus,et al.  Cauchy nonnegative matrix factorization , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[12]  Gerhard Widmer,et al.  A fully convolutional deep auditory model for musical chord recognition , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[13]  Antoine Liutkus,et al.  Scalable audio separation with light Kernel Additive Modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[16]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Antoine Liutkus,et al.  Kernel Additive Models for Source Separation , 2014, IEEE Transactions on Signal Processing.

[23]  Jonathan Le Roux,et al.  Deep NMF for speech separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[25]  Antoine Liutkus,et al.  PROJET — Spatial audio separation using projections , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Luc Van Gool,et al.  AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[27]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[28]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.