Improving Reverberant Speech Separation with Binaural Cues Using Temporal Context and Convolutional Neural Networks

Given binaural features as input, such as interaural level difference and interaural phase difference, Deep Neural Networks (DNNs) have been recently used to localize sound sources in a mixture of speech signals and/or noise, and to create time-frequency masks for the estimation of the sound sources in reverberant rooms. Here, we explore a more advanced system, where feed-forward DNNs are replaced by Convolutional Neural Networks (CNNs). In addition, the adjacent frames of each time frame (occurring before and after this frame) are used to exploit contextual information, thus improving the localization and separation for each source. The quality of the separation results is evaluated in terms of Signal to Distortion Ratio (SDR).

[1]  Yi Jiang,et al.  Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[3]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[4]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[5]  Yang Yu,et al.  Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks , 2016, EURASIP J. Audio Speech Music. Process..

[6]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[7]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Christopher Hummersone,et al.  A Psychoacoustic Engineering Approach to Machine Sound Source Separation in Reverberant Environments , 2011 .

[9]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[10]  Yong Xu,et al.  Binaural and log-power spectra features with deep neural networks for speech-noise separation , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[11]  Emanuel A. P. Habets,et al.  Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise , 2017, ArXiv.

[12]  Emanuel A. P. Habets,et al.  Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[13]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[14]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.