A Classification-based Cocktail-party Processor

At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners.

[1]  Bill Gardner,et al.  HRTF Measurements of a KEMAR Dummy-Head Microphone , 1994 .

[2]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[3]  Hervé Glotin,et al.  A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition , 1999, EUROSPEECH.

[4]  Volker Hohmann,et al.  Strategy-selective noise reduction for binaural digital hearing aids , 2003, Speech Commun..

[5]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  DeLiang Wang,et al.  Monaural Speech Separation , 2002, NIPS.

[7]  M. Bodden Modeling human sound-source localization and the cocktail-party-effect , 1993 .

[8]  John Bamford,et al.  Speech-hearing tests and the spoken language of hearing-impaired children , 1979 .

[9]  B C Wheeler,et al.  A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers. , 2001, The Journal of the Acoustical Society of America.

[10]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[11]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[12]  J. Blauert Spatial Hearing: The Psychophysics of Human Sound Localization , 1983 .

[13]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..