论文信息 - A Binaural Scene Analyzer for Joint Localization and Recognition of Speakers in the Presence of Interfering Noise Sources and Reverberation

A Binaural Scene Analyzer for Joint Localization and Recognition of Speakers in the Presence of Interfering Noise Sources and Reverberation

In this study, we present a binaural scene analyzer that is able to simultaneously localize, detect and identify a known number of target speakers in the presence of spatially positioned noise sources and reverberation. In contrast to many other binaural cocktail party processors, the proposed system does not require a priori knowledge about the azimuth position of the target speakers. The proposed system consists of three main building blocks: binaural localization, speech source detection, and automatic speaker identification. First, a binaural front-end is used to robustly localize relevant sound source activity. Second, a speech detection module based on missing data classification is employed to determine whether detected sound source activity corresponds to a speaker or to an interfering noise source using a binary mask that is based on spatial evidence supplied by the binaural front-end. Third, a second missing data classifier is used to recognize the speaker identities of all detected speech sources. The proposed system is systematically evaluated in simulated adverse acoustic scenarios. Compared to state-of-the art MFCC recognizers, the proposed model achieves significant speaker recognition accuracy improvements.

[1] S. Carlile,et al. Speech localization in a multitalker mixture. , 2010, The Journal of the Acoustical Society of America.

[2] DeLiang Wang,et al. Sequential Organization of Speech in Reverberant Environments by Integrating Monaural Grouping and Binaural Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[4] Norbert Dillier,et al. A fast and accurate “shoebox” room acoustics simulator , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5] Jont B. Allen,et al. Image method for efficiently simulating small‐room acoustics , 1976 .

[6] DeLiang Wang,et al. Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. , 2006, The Journal of the Acoustical Society of America.

[7] DeLiang Wang,et al. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[8] Steven van de Par,et al. Binaural detection of speech sources in complex acoustic scenes , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[9] Steven van de Par,et al. Noise-Robust Speaker Recognition Combining Missing Data Techniques and Universal Background Modeling , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10] DeLiang Wang,et al. Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[11] Yoshitaka Nakajima,et al. Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[12] Barbara G. Shinn-Cunningham,et al. Effect of source location and listener location on ILD cues in a reverberant room , 2004 .

[13] Guy J. Brown,et al. Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[14] Hynek Hermansky,et al. Multi-band and adaptation approaches to robust speech recognition , 1997, EUROSPEECH.

[15] Andreas Stolcke,et al. Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[16] Tomi Kinnunen,et al. Signal-to-Signal Ratio Independent Speaker Identification for Co-channel Speech Signals , 2010, 2010 20th International Conference on Pattern Recognition.

[17] Brian R Glasberg,et al. Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[18] Richard M. Stern,et al. Binaural sound source separation motivated by auditory processing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Guy J. Brown,et al. Recognition of Reverberant Speech using Full Cepstral Features and Spectral Missing Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20] Guy J. Brown,et al. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation , 2004, Speech Commun..

[21] Guy J. Brown,et al. Mask estimation for missing data speech recognition based on statistics of binaural interaction , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22] Steven van de Par,et al. A Probabilistic Model for Robust Localization Based on a Binaural Auditory Front-End , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23] David Malah,et al. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[24] Bill Gardner,et al. HRTF Measurements of a KEMAR Dummy-Head Microphone , 1994 .

[25] Brian D. Simpson,et al. DETECTION AND LOCALIZATION OF SPEECH IN THE PRESENCE OF COMPETING SPEECH SIGNALS , 2006 .

[26] Ning Ma,et al. A speech fragment approach to localising multiple speakers in reverberant environments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27] Tom Fawcett,et al. An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[28] E. C. Cherry. Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[29] E. C. Cmm,et al. on the Recognition of Speech, with , 2008 .

[30] Hans-Günter Hirsch,et al. Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[31] R.M. Stern,et al. Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[32] Frederick J. Gallun,et al. The advantage of knowing where to listen. , 2005, The Journal of the Acoustical Society of America.

[33] DeLiang Wang,et al. Speech intelligibility in background noise with ideal binary time-frequency masking. , 2009, The Journal of the Acoustical Society of America.

[34] DeLiang Wang,et al. Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[35] Phil D. Green,et al. Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[36] Andrzej Drygajlo,et al. Missing features detection and handling for robust speaker verification , 1999, EUROSPEECH.

[37] Martin Cooke,et al. A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[38] Ray Meddis,et al. Across frequency integration in a model of lateralization , 1992 .

[39] A. Zeiberg,et al. Lateralization of complex binaural stimuli: a weighted-image model. , 1988, The Journal of the Acoustical Society of America.

[40] Roberto Togneri,et al. Robust speaker identification using combined feature selection and missing data recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41] Aaron E. Rosenberg,et al. On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42] B G Shinn-Cunningham,et al. Spatial unmasking of nearby speech sources in a simulated anechoic environment. , 2001, The Journal of the Acoustical Society of America.

[43] Guy J. Brown,et al. Computational auditory scene analysis , 1994, Comput. Speech Lang..

[44] John S. D. Mason,et al. On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[45] Ruth Y Litovsky,et al. The benefit of binaural hearing in a cocktail party: effect of location and type of interferer. , 2004, The Journal of the Acoustical Society of America.

[46] Barbara G Shinn-Cunningham,et al. Localizing nearby sound sources in a classroom: binaural room impulse responses. , 2005, The Journal of the Acoustical Society of America.

[47] Yang Lu,et al. An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[48] Philipos C Loizou,et al. Factors influencing glimpsing of speech in noise. , 2007, The Journal of the Acoustical Society of America.

[49] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.