Speech Separation Based on The Statistics of Binaural Auditory Features

A computational auditory scene analysis (CASA) system is described, in which sound separation according to spatial location is combined with the 'missing data' approach for automatic speech recognition. Time-frequency masks for the missing data recognizer are derived from the statistics of interaural time and level differences; these masks identify acoustic features that constitute reliable evidence of the target speech signal. It is demonstrated that this approach yields good performance in a challenging environment, in which a target voice is contaminated by another talker and reverberation. The ability of the system to generalize to source-receiver configurations that were not encountered during training is discussed

[1]  Guy J. Brown,et al.  Mask estimation for missing data speech recognition based on statistics of binaural interaction , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Jon Barker,et al.  Soft decisions in missing data techniques for robust automatic speech recognition , 2000, INTERSPEECH.

[3]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[4]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[5]  Guy J. Brown,et al.  A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation , 2004, Speech Commun..

[6]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.