Raw Multi-Channel Audio Source Separation using Multi- Resolution Convolutional Auto-Encoders

Supervised multi-channel audio source separation requires extracting useful spectral, temporal, and spatial features from the mixed signals. the success of many existing systems is therefore largely dependent on the choice of features used for training. In this work, we introduce a novel multi-channel, multiresolution convolutional auto-encoder neural network that works on raw time-domain signals to determine appropriate multiresolution features for separating the singing-voice from stereo music. Our experimental results show that the proposed method can achieve multi-channel audio source separation without the need for hand-crafted features or any pre- or post-processing.

[1]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Antoine Liutkus,et al.  Scalable audio separation with light Kernel Additive Modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Timo Gerkmann,et al.  STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Andrew J. R. Simpson Time-Frequency Trade-offs for Audio Source Separation with Binary Masks , 2015, ArXiv.

[7]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Kyogu Lee,et al.  Singing Voice Separation Using RPCA with Weighted l_1 -norm , 2017, LVA/ICA.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Antoine Liutkus,et al.  Common fate model for unison source separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  György Fazekas,et al.  A Tutorial on Deep Learning for Music Information Retrieval , 2017, ArXiv.

[13]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Emmanuel Vincent,et al.  Musical source separation using time-frequency source priors , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Mark D. Plumbley,et al.  Single channel audio source separation using convolutional denoising autoencoders , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[16]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[17]  Mark D. Plumbley,et al.  Single Channel Audio Source Separation using Deep Neural Network Ensembles , 2016 .

[18]  Tara N. Sainath,et al.  Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Yang Yu,et al.  Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks , 2016, EURASIP J. Audio Speech Music. Process..

[20]  Garrett T. Kenyon,et al.  Does Phase Matter For Monaural Source Separation? , 2017, ArXiv.

[21]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yong Xu,et al.  Binaural and log-power spectra features with deep neural networks for speech-noise separation , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[24]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[25]  Wai Lok Woo,et al.  Unsupervised Single-Channel Separation of Nonstationary Signals Using Gammatone Filterbank and Itakura–Saito Nonnegative Matrix Two-Dimensional Factorizations , 2013, IEEE Transactions on Circuits and Systems I: Regular Papers.

[26]  Emilia Gómez,et al.  Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[27]  Antoine Liutkus,et al.  The 2016 Signal Separation Evaluation Campaign , 2017, LVA/ICA.

[28]  Gaël Richard,et al.  A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[29]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Francesco Visin,et al.  A guide to convolution arithmetic for deep learning , 2016, ArXiv.

[31]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[32]  Jonah Casebeer,et al.  Adaptive Front-ends for End-to-end Source Separation , 2017 .