Incorporating localisation cues in a fragment decoding framework for distant binaural speech recognition

This paper addresses the problem of speech recognition using distant microphones in reverberant multisource noise conditions. Specifically, the experiments employ recordings of a noisy domestic living room made using a pair of microphones in a binaural configuration, to which target speech has been added after convolution with binaural room impulse responses. Our scheme employs two stages: first spectro-temporal acoustic source fragments are located using signal level cues, and second, a top-down hypothesis-driven stage simultaneously searches for themost probable allocation of fragments to target or masker and the corresponding acoustic model state sequence. The paper reports a first attempt to use of binaural localisation cues within this framework. Our initial experiments with localisation cues have not improved the baseline performance that uses single channel source separation cues alone. The paper discusses potential reasons for the lack of improvement and suggests fresh ideas that may prove more successful.

[1]  A. Bregman Auditory Scene Analysis , 2008 .

[2]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[3]  J. Licklider,et al.  A duplex theory of pitch perception , 1951, Experientia.

[4]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[5]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[6]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[7]  Ning Ma,et al.  A speech fragment approach to localising multiple speakers in reverberant environments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Richard F. Lyon,et al.  A perceptual pitch detector , 1990, International Conference on Acoustics, Speech, and Signal Processing.