Group delay based music source separation using deep recurrent neural networks

Deep Recurrent Neural Networks (DRNNs) have been most successfully used in solving the challenging task of separating sources from a single channel acoustic mixture. Conventionally, magnitude spectra are being used to learn the characteristics of individual sources in such monaural blind source separation (BSS) task. The phase spectra which inherently contain the timing information is often ignored. In this work, we explore the use of modified group delay (MOD-GD) function for learning the time-frequency masks of the sources in the monaural BSS problem. We demonstrate the use of MOD-GD through two music source separation tasks: singing voice separation on the MIR-1K data set and vocal-violin separation on the Carnatic music data set. We find that it outperforms the state-of-the-art feature in terms of Signal to Interference Ratio (SIR). Moreover, training and testing times are significantly reduced (by 50%) without compromising on the performance for the best performing DRNN configuration.

[1]  Bhiksha Raj,et al.  Non-negative Hidden Markov Modeling of Audio with Application to Source Separation , 2010, LVA/ICA.

[2]  Hema A. Murthy,et al.  Musical onset detection on carnatic percussion instruments , 2015, 2015 Twenty First National Conference on Communications (NCC).

[3]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[6]  Paris Smaragdis,et al.  Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[7]  Gautham J. Mysore,et al.  Exploiting long-term temporal dependencies in NMF using recurrent neural networks with application to source separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Peter Kulchyski and , 2015 .

[9]  Andrew J. R. Simpson Probabilistic Binary-Mask Cocktail-Party Source Separation in a Convolutional Deep Neural Network , 2015, ArXiv.

[10]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[11]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Hema A. Murthy,et al.  Modified Group Delay Based MultiPitch Estimation in Co-Channel Speech , 2016, ArXiv.

[14]  Hema A. Murthy,et al.  Feature Switching in the i-vector framework for speaker verification , 2014, INTERSPEECH.

[15]  B. Yegnanarayana,et al.  Processing of noisy speech using modified group delay functions , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Hema A. Murthy,et al.  A novel application of group delay function for identifying tonic in Carnatic music , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[17]  HEMA A MURTHY,et al.  Group delay functions and its applications in speech technology , 2011 .

[18]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[20]  Hema A. Murthy,et al.  Group delay based melody monopitch extraction from music , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.