Fully complex deep neural network for phase-incorporating monaural source separation

Deep neural network (DNN) have become a popular means of separating a target source from a mixed signal. Most of DNN-based methods modify only the magnitude spectrum of the mixture. The phase spectrum is left unchanged, which is inherent in the short-time Fourier transform (STFT) coefficients of the input signal. However, recent studies have revealed that incorporating phase information can improve the quality of separated sources. To estimate simultaneously the magnitude and the phase of STFT coefficients, this work paper developed a fully complex-valued deep neural network (FCDNN) that learns the nonlinear mapping from complex-valued STFT coefficients of a mixture to sources. In addition, to reinforce the sparsity of the estimated spectra, a sparse penalty term is incorporated into the objective function of the FCDNN. Finally, the proposed method is applied to singing source separation. Experimental results indicate that the proposed method outperforms the state-of-the-art DNN-based methods.

[1]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[2]  Bhiksha Raj,et al.  On the Appropriateness of Complex-Valued Neural Networks for Speech Enhancement , 2016, INTERSPEECH.

[3]  Jonathan Le Roux,et al.  Phase Processing for Single-Channel Speech Enhancement: History and recent advances , 2015, IEEE Signal Processing Magazine.

[4]  Nam Soo Kim,et al.  NMF-based Target Source Separation Using Deep Neural Network , 2015, IEEE Signal Processing Letters.

[5]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  DeLiang Wang,et al.  Complex ratio masking for joint enhancement of magnitude and phase , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  C.-C. Jay Kuo,et al.  Sparse Music Representation With Source-Specific Dictionaries and Its Application to Signal Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Bhiksha Raj,et al.  Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures , 2007, ICA.

[11]  Andrew J. R. Simpson Deep Transform: Cocktail Party Source Separation via Complex Convolution in a Deep Neural Network , 2015, ArXiv.

[12]  Surya Ganguli,et al.  Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods , 2013, ICML.

[13]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Hao Li,et al.  Exploiting spectro-temporal structures using NMF for DNN-based supervised speech separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  WangDeLiang,et al.  Complex ratio masking for monaural speech separation , 2016 .

[16]  Jen-Tzung Chien,et al.  Discriminative deep recurrent neural networks for monaural speech separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Timo Gerkmann,et al.  STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[20]  Francesco Piazza,et al.  On the complex backpropagation algorithm , 1992, IEEE Trans. Signal Process..

[21]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Ken Kreutz-Delgado,et al.  The Complex Gradient Operator and the CR-Calculus ECE275A - Lecture Supplement - Fall 2005 , 2009, 0906.4835.

[23]  Athanassios S. Fokas,et al.  Complex Variables: Contents , 2003 .

[24]  Nitzan Guberman,et al.  On Complex Valued Convolutional Neural Networks , 2016, ArXiv.

[25]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Kiyohiro Shikano,et al.  Robust music signal separation based on supervised nonnegative matrix factorization with prevention of basis sharing , 2013, IEEE International Symposium on Signal Processing and Information Technology.