Improving music source separation based on deep neural networks through data augmentation and network blending

This paper deals with the separation of music into individual instrument tracks which is known to be a challenging problem. We describe two different deep neural network architectures for this task, a feed-forward and a recurrent one, and show that each of them yields themselves state-of-the art results on the SiSEC DSD100 dataset. For the recurrent network, we use data augmentation during training and show that even simple separation networks are prone to overfitting if no data augmentation is used. Furthermore, we propose a blending of both neural network systems where we linearly combine their raw outputs and then perform a multi-channel Wiener filter post-processing. This blending scheme yields the best results that have been reported to-date on the SiSEC DSD100 dataset.

[1]  Yehuda Koren,et al.  Lessons from the Netflix prize challenge , 2007, SKDD.

[2]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[3]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Hakan Erdogan,et al.  Deep neural networks for single channel source separation , 2013, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Gaël Richard,et al.  A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[6]  Koeng-Mo Sung,et al.  Stereo Music Source Separation for 3-D Upmixing , 2009 .

[7]  Mark D. Plumbley,et al.  Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network , 2015, LVA/ICA.

[8]  Hong Yang,et al.  Cross-domain cooperative deep stacking network for speech separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  DeLiang Wang,et al.  A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Juan Pablo Bello,et al.  A Software Framework for Musical Data Augmentation , 2015, ISMIR.

[11]  J. Eggert,et al.  Sparse coding and NMF , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[12]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[13]  Jonathan Le Roux,et al.  Discriminative NMF and its application to single-channel source separation , 2014, INTERSPEECH.

[14]  Paris Smaragdis,et al.  Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[15]  Emmanuel Vincent,et al.  Multichannel music separation with deep neural networks , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[16]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[17]  Mark D. Plumbley,et al.  Single Channel Audio Source Separation using Deep Neural Network Ensembles , 2016 .

[18]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  DeLiang Wang,et al.  Multi-resolution stacking for speech separation based on boosted DNN , 2015, INTERSPEECH.

[20]  Jonathan Le Roux,et al.  Deep NMF for speech separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Antoine Liutkus,et al.  Robust ASR using neural network based speech enhancement and feature simulation , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[22]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Derry FitzGerald,et al.  Upmixing from mono - A source separation approach , 2011, 2011 17th International Conference on Digital Signal Processing (DSP).

[24]  James Bennett,et al.  The Netflix Prize , 2007 .

[25]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Jonathan Le Roux,et al.  Ensemble learning for speech enhancement , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[27]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[28]  Emmanuel Vincent,et al.  Fusion Methods for Speech Enhancement and Audio Source Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Thomas Grill,et al.  Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks , 2015, ISMIR.

[30]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[32]  Franck Giron,et al.  Deep neural network based instrument extraction from music , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Mark D. Plumbley,et al.  Combining Mask Estimates for Single Channel Audio Source Separation Using Deep Neural Networks , 2016, INTERSPEECH.

[34]  Emmanuel Vincent,et al.  Introducing a simple fusion framework for audio source separation , 2013, 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[35]  Derry Fitzgerald,et al.  The Good Vibrations Problem , 2013 .

[36]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.