Deep Learning Based Binaural Speech Separation in Reverberant Environments

Speech signal is usually degraded by room reverberation and additive noises in real environments. This paper focuses on separating target speech signal in reverberant conditions from binaural inputs. Binaural separation is formulated as a supervised learning problem, and we employ deep learning to map from both spatial and spectral features to a training target. With binaural inputs, we first apply a fixed beamformer and then extract several spectral features. A new spatial feature is proposed and extracted to complement the spectral features. The training target is the recently suggested ideal ratio mask. Systematic evaluations and comparisons show that the proposed system achieves very good separation performance and substantially outperforms related algorithms under challenging multisource and reverberant environments.

[1]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[2]  Xiaofei Wang,et al.  Oracle performance investigation of the ideal masks , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[3]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[4]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[5]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[6]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[7]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[9]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[10]  Guy J. Brown,et al.  Exploiting deep neural networks and head movements for binaural localisation of multiple speakers in reverberant conditions , 2015, INTERSPEECH.

[11]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[12]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Marc Moonen,et al.  Robustness analysis of multichannel Wiener filtering and generalized sidelobe cancellation for multimicrophone noise reduction in hearing aid applications , 2005, IEEE Transactions on Speech and Audio Processing.

[14]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[15]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[17]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[18]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21]  DeLiang Wang,et al.  An algorithm to improve speech recognition in noise for hearing-impaired listeners. , 2013, The Journal of the Acoustical Society of America.

[22]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[23]  DeLiang Wang,et al.  Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[24]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[25]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[26]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[27]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  J. Mccroskey,et al.  Human Communication , 2008 .

[29]  Yi Jiang,et al.  Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  D. Wang,et al.  The time dimension for scene analysis , 2005, IEEE Transactions on Neural Networks.

[33]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[34]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[35]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[37]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Hui Zhang,et al.  A Pairwise Algorithm Using the Deep Stacking Network for Speech Separation and Pitch Estimation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.