Binaural sound source separation motivated by auditory processing

In this paper we present a new method of signal processing for robust speech recognition using two microphones. The method, loosely based on the human binaural hearing system, consists of passing the speech signals detected by two microphones through bandpass filtering. We develop a spatial masking function based on normalized cross-correlation, which provides rejection of off-axis interfering signals. To obtain improvements in reverberant environments, a temporal masking component, which is closely related to our previously-described de-reverberation technique known as SSF. We demonstrate that this approach provides substantially better recognition accuracy than conventional binaural sound-source separation algorithms.

[1]  Guy J. Brown,et al.  Mask estimation for missing data speech recognition based on statistics of binaural interaction , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Richard M. Stern,et al.  Spatial separation of speech signals using amplitude estimation based on interaural comparisons of zero-crossings , 2009, Speech Commun..

[3]  Richard M. Stern,et al.  Automatic selection of thresholds for signal separation algorithms based on interaural delay , 2010, INTERSPEECH.

[4]  Parham Aarabi,et al.  Real-time dual-microphone speech enhancement using field programmable gate arrays , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  DeLiang Wang,et al.  Binaural Sound Localization , 2006 .

[6]  Hyung-Min Park,et al.  Binaural and Multiple-Microphone Signal Processing Motivated by Auditory Perception , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[7]  Richard M. Stern,et al.  Nonlinear enhancement of onset for robust speech recognition , 2010, INTERSPEECH.

[8]  Guy J. Brown,et al.  A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation , 2004, Speech Commun..

[9]  B. Moore,et al.  A revision of Zwicker's loudness model , 1996 .

[10]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[11]  Richard M. Stern,et al.  Power function-based power distribution normalization algorithm for robust speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[12]  Richard M. Stern,et al.  Interaural Correlation as the Basis of a Working Model of Binaural Processing: An Introduction , 2005 .

[13]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[15]  Richard M. Stern,et al.  Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction , 2009, INTERSPEECH.

[16]  Parham Aarabi,et al.  Phase-based dual-microphone robust speech enhancement , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  D. Grantham Chapter 9 – Spatial Hearing and Related Phenomena , 1995 .

[18]  Richard M. Stern,et al.  Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain , 2009, INTERSPEECH.

[19]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.