Mask estimation based on sound localisation for missing data speech recognition

This paper describes a perceptually motivated computational auditory scene analysis (CASA) system that combines sound separation according to spatial location with 'missing data' techniques for robust speech recognition in noise. Missing data time-frequency masks are produced using cross-correlation to estimate interaural time difference (ITD) and hence spatial azimuth; this is used to determine which regions of the signal constitute reliable evidence of the target speech signal. Three experiments are performed that compare the effects of different reverberation surfaces, localisation methods and azimuth separations on recognition accuracy, together with the effects of two post-processing techniques (morphological operations and supervised learning) for improving mask estimation. Both post-processing techniques greatly improve performance; the best performance occurs using a learned mapping.