Combining Binaural and Cortical Features for Robust Speech Recognition

The segregation of concurrent speakers and other sound sources is an important ability of the human auditory system, but is missing in most current systems for automatic speech recognition (ASR), resulting in a large gap between human and machine performance. This study combines processing related to peripheral and cortical stages of the auditory pathway: A physiologically motivated binaural model estimates the positions of moving speakers to enhance the desired speech signal. Second, signals are converted to spectro-temporal Gabor features that resemble cortical speech representations and which have been shown to improve ASR in noisy conditions. Spectro-temporal Gabor features improve recognition results in all acoustic conditions under consideration compared with Mel-frequency cepstral coefficients. Binaural processing results in lower word error rates (WERs) in acoustic scenes with a concurrent speaker, whereas monaural processing should be preferred in the presence of a stationary masking noise. In-depth analysis of binaural processing identifies crucial processing steps such as localization of sound sources and estimation of the beamformer's noise coherence matrix, and shows how much each processing step affects the recognition performance in acoustic conditions with different complexity.

[1]  Mohsen Rahmani,et al.  Noise cross PSD estimation using phase information in diffuse noise field , 2009, Signal Process..

[2]  B. Kollmeier,et al.  Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[3]  Hans-Günter Hirsch,et al.  Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Birger Kollmeier,et al.  Combining speech enhancement and auditory feature extraction for robust speech recognition , 2000, Speech Commun..

[5]  Constantin Spille,et al.  Improving automatic speech recognition in spatially-aware hearing aids , 2015, INTERSPEECH.

[6]  S. R. Mahadeva Prasanna,et al.  Two speaker speech separation by LP residual weighting and harmonics enhancement , 2010, Int. J. Speech Technol..

[7]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[8]  A. Buchner,et al.  Relative influence of interaural time and intensity differences on lateralization is modulated by attention to one or the other cue: 500-Hz sine tones. , 2009, The Journal of the Acoustical Society of America.

[9]  Tim Jürgens,et al.  NOISE ROBUST DISTANT AUTOMATIC SPEECH RECOGNITION UTILIZING NMF BASED SOURCE SEPARATION AND AUDITORY FEATURE EXTRACTION , 2013 .

[10]  Volker Hohmann,et al.  Using binarual processing for automatic speech recognition in multi-talker scenes , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Birger Kollmeier,et al.  Hooking up spectro-temporal filters with auditory-inspired representations for robust automatic speech recognition , 2012, INTERSPEECH.

[12]  Jwu-Sheng Hu,et al.  Estimation of Sound Source Number and Directions under a Multisource Reverberant Environment , 2010, EURASIP J. Adv. Signal Process..

[13]  Gerald Friedland,et al.  Exploring methods of improving speaker accuracy for speaker diarization , 2013, INTERSPEECH.

[14]  Volker Hohmann,et al.  Auditory model based direction estimation of concurrent speakers from binaural signals , 2011, Speech Commun..

[15]  DeLiang Wang,et al.  The role of binary mask patterns in automatic speech recognition in background noise. , 2013, The Journal of the Acoustical Society of America.

[16]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[17]  P. Stoica,et al.  Robust Adaptive Beamforming , 2013 .

[18]  DeLiang Wang,et al.  A Direct Masking Approach to Robust ASR , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  S. Särkkä,et al.  RBMCDAbox-Matlab Toolbox of Rao-Blackwellized Data Association Particle Filters , 2008 .

[20]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[21]  Stephen V. David,et al.  Representation of Phonemes in Primary Auditory Cortex: How the Brain Analyzes Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22]  Nelson Morgan,et al.  Longer Features: They do a speech detector good , 2012, INTERSPEECH.

[23]  Yoshitaka Nakajima,et al.  Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[24]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[25]  Combining Speech Fragment Decoding and Adaptive Noise Floor Modeling , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Steven van de Par,et al.  Noise-Robust Speaker Recognition Combining Missing Data Techniques and Universal Background Modeling , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Hyung Soon Kim,et al.  Evaluation of Frequency Warping Based Features and Spectro-Temporal Features for Speaker Recognition , 2015 .

[28]  DeLiang Wang,et al.  Binaural speech segregation based on pitch and azimuth tracking , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  David L. Soldan,et al.  Noise suppression methods for speech applications , 1983, ICASSP.

[30]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[31]  Bernd T. Meyer,et al.  Spectro-temporal Gabor features for speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[33]  Iain McCowan,et al.  Segmenting multiple concurrent speakers using microphone arrays , 2003, INTERSPEECH.

[34]  H. Vincent Poor,et al.  Estimation of the number of sources in unbalanced arrays via information theoretic criteria , 2005, IEEE Transactions on Signal Processing.

[35]  Parham Aarabi,et al.  Robust sound localization using multi-source audiovisual information fusion , 2001, Inf. Fusion.

[36]  C. Schreiner,et al.  Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. , 2003, Journal of neurophysiology.

[37]  Klaus Uwe Simmer,et al.  Superdirective Microphone Arrays , 2001, Microphone Arrays.

[38]  B. Kollmeier,et al.  Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. , 2012, The Journal of the Acoustical Society of America.

[39]  Volker Hohmann,et al.  Database of Multichannel In-Ear and Behind-the-Ear Head-Related and Binaural Room Impulse Responses , 2009, EURASIP J. Adv. Signal Process..

[40]  Takayuki Arai,et al.  Estimating number of speakers by the modulation characteristics of speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[41]  Chuohao Yeo,et al.  Visual speaker localization aided by acoustic models , 2009, MM '09.

[42]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[43]  Steven van de Par,et al.  A Binaural Scene Analyzer for Joint Localization and Recognition of Speakers in the Presence of Interfering Noise Sources and Reverberation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Dorothea Kolossa,et al.  CHiME Challenge: Approaches to Robustness using Beamforming and Uncertainty-of-Observation Techniques , 2011 .

[45]  Jouko Lampinen,et al.  Rao-Blackwellized particle filter for multiple target tracking , 2007, Inf. Fusion.

[46]  Volker Hohmann,et al.  Binaural Scene Analysis with Multidimensional Statistical Filters , 2013 .

[47]  Richard C. Hendriks,et al.  Noise Correlation Matrix Estimation for Multi-Microphone Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  Bayya Yegnanarayana,et al.  Determining Number of Speakers From Multispeaker Speech Signals Using Excitation Source Information , 2007, IEEE Signal Processing Letters.