A new feature set for masking-based monaural speech separation

We propose a new feature based on a gammatone filter bank for improving monaural speech separation using neural networks. This new feature encodes not only the local information of cochleagram, and spectrotemporal context, similar to previous approaches, but also captures time-frequency dynamics in the spectrotemporal context using an image processing technique. Speech separation was achieved by computing optimal time-frequency masks using two types of neural networks (DNN and LSTM) to determine the interactions between feature and training model properties. The performance of our feature was evaluated in a variety of simulated environments having different non-stationary noises and reverberation times and quantified using three objective measures. Experimental results show that the proposed monaural feature set improves the objective speech intelligibility, speech quality and signal-to-noise ratio compared to prior feature sets in noisy and reverberant environments with particular benefit in speech intelligibility.

[1]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[2]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[3]  Kostas Kokkinakis,et al.  An Interaural Magnification Algorithm for Enhancement of Naturally-Occurring Level Differences , 2016, INTERSPEECH.

[4]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[6]  Yi Hu,et al.  Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[7]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[8]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[9]  Kostas Kokkinakis,et al.  Time-Frequency Masking for Blind Source Separation with Preserved Spatial Cues , 2017, INTERSPEECH.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  DeLiang Wang,et al.  Features for Masking-Based Monaural Speech Separation in Reverberant Conditions , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[17]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[18]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[19]  Jesper Jensen,et al.  An evaluation of objective measures for intelligibility prediction of time-frequency weighted noisy speech. , 2011, The Journal of the Acoustical Society of America.

[20]  Tim Brookes,et al.  Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  DeLiang Wang,et al.  A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.