Using auditory saliency to understand complex auditory scenes

In this paper, we present a computational model for predicting pre-attentive, bottom-up auditory saliency. The model determines perceptually what in a scene stands out to observers and can be used to determine what part of a complex auditory scene is most important. The vision equivalency of this is visual saliency as defined by Koch and others [1]. The model is based on inhibition of features obtained from auditory spectro-temporal receptive fields (STRFs) and produces results that match well with preliminary psychoacoustic experiments. The model does well in predicting what is salient for some common auditory examples and there is a strong correlation between scenes chosen as salient by the model and scenes that human subjects selected as salient.