A Binaural Scene Analyzer for Joint Localization and Recognition of Speakers in the Presence of Interfering Noise Sources and Reverberation

In this study, we present a binaural scene analyzer that is able to simultaneously localize, detect and identify a known number of target speakers in the presence of spatially positioned noise sources and reverberation. In contrast to many other binaural cocktail party processors, the proposed system does not require a priori knowledge about the azimuth position of the target speakers. The proposed system consists of three main building blocks: binaural localization, speech source detection, and automatic speaker identification. First, a binaural front-end is used to robustly localize relevant sound source activity. Second, a speech detection module based on missing data classification is employed to determine whether detected sound source activity corresponds to a speaker or to an interfering noise source using a binary mask that is based on spatial evidence supplied by the binaural front-end. Third, a second missing data classifier is used to recognize the speaker identities of all detected speech sources. The proposed system is systematically evaluated in simulated adverse acoustic scenarios. Compared to state-of-the art MFCC recognizers, the proposed model achieves significant speaker recognition accuracy improvements.

[1]  S. Carlile,et al.  Speech localization in a multitalker mixture. , 2010, The Journal of the Acoustical Society of America.

[2]  DeLiang Wang,et al.  Sequential Organization of Speech in Reverberant Environments by Integrating Monaural Grouping and Binaural Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[4]  Norbert Dillier,et al.  A fast and accurate “shoebox” room acoustics simulator , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[6]  DeLiang Wang,et al.  Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. , 2006, The Journal of the Acoustical Society of America.

[7]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[8]  Steven van de Par,et al.  Binaural detection of speech sources in complex acoustic scenes , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[9]  Steven van de Par,et al.  Noise-Robust Speaker Recognition Combining Missing Data Techniques and Universal Background Modeling , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[11]  Yoshitaka Nakajima,et al.  Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[12]  Barbara G. Shinn-Cunningham,et al.  Effect of source location and listener location on ILD cues in a reverberant room , 2004 .

[13]  Guy J. Brown,et al.  Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[14]  Hynek Hermansky,et al.  Multi-band and adaptation approaches to robust speech recognition , 1997, EUROSPEECH.

[15]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[16]  Tomi Kinnunen,et al.  Signal-to-Signal Ratio Independent Speaker Identification for Co-channel Speech Signals , 2010, 2010 20th International Conference on Pattern Recognition.

[17]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[18]  Richard M. Stern,et al.  Binaural sound source separation motivated by auditory processing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Guy J. Brown,et al.  Recognition of Reverberant Speech using Full Cepstral Features and Spectral Missing Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Guy J. Brown,et al.  A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation , 2004, Speech Commun..

[21]  Guy J. Brown,et al.  Mask estimation for missing data speech recognition based on statistics of binaural interaction , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Steven van de Par,et al.  A Probabilistic Model for Robust Localization Based on a Binaural Auditory Front-End , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[24]  Bill Gardner,et al.  HRTF Measurements of a KEMAR Dummy-Head Microphone , 1994 .

[25]  Brian D. Simpson,et al.  DETECTION AND LOCALIZATION OF SPEECH IN THE PRESENCE OF COMPETING SPEECH SIGNALS , 2006 .

[26]  Ning Ma,et al.  A speech fragment approach to localising multiple speakers in reverberant environments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[28]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[29]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[30]  Hans-Günter Hirsch,et al.  Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[31]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[32]  Frederick J. Gallun,et al.  The advantage of knowing where to listen. , 2005, The Journal of the Acoustical Society of America.

[33]  DeLiang Wang,et al.  Speech intelligibility in background noise with ideal binary time-frequency masking. , 2009, The Journal of the Acoustical Society of America.

[34]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[35]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[36]  Andrzej Drygajlo,et al.  Missing features detection and handling for robust speaker verification , 1999, EUROSPEECH.

[37]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[38]  Ray Meddis,et al.  Across frequency integration in a model of lateralization , 1992 .

[39]  A. Zeiberg,et al.  Lateralization of complex binaural stimuli: a weighted-image model. , 1988, The Journal of the Acoustical Society of America.

[40]  Roberto Togneri,et al.  Robust speaker identification using combined feature selection and missing data recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Aaron E. Rosenberg,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  B G Shinn-Cunningham,et al.  Spatial unmasking of nearby speech sources in a simulated anechoic environment. , 2001, The Journal of the Acoustical Society of America.

[43]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[44]  John S. D. Mason,et al.  On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Ruth Y Litovsky,et al.  The benefit of binaural hearing in a cocktail party: effect of location and type of interferer. , 2004, The Journal of the Acoustical Society of America.

[46]  Barbara G Shinn-Cunningham,et al.  Localizing nearby sound sources in a classroom: binaural room impulse responses. , 2005, The Journal of the Acoustical Society of America.

[47]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[48]  Philipos C Loizou,et al.  Factors influencing glimpsing of speech in noise. , 2007, The Journal of the Acoustical Society of America.

[49]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.