Location-based sound segregation

At a cocktail party, we can selectively attend to a single voice and filter out all the other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel location-based approach for speech segregation. The auditory masking effect motivates the notion of an “ideal” time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency region. We observe that within a narrow frequency band modifications to the relative energy of the target source with respect to the interfering energy trigger systematic deviations for binaural cues. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation shows that the resulting system produces masks very close to ideal binary ones, and large improvement over previous models.