Automatic detection of auditory salience with optimized linear filters derived from human annotation

Auditory salience describes how much a particular auditory event attracts human attention. Previous attempts at automatic detection of salient audio events have been hampered by the challenge of defining ground truth. In this paper ground truth for auditory salience is built up from annotations by human subjects of a large corpus of meeting room recordings. Following statistical purification of the data, an optimal auditory salience filter with linear discrimination is derived from the purified data. An automatic auditory salience detector based on optimal filtering of the Bark-frequency loudness performs with 32% equal error rate. Expanding the feature vector to include other common feature sets does not improve performance. Consistent with intuition, the optimal filter looks like an onset detector in the time domain.

[1]  Dimitri P. Bertsekas,et al.  Numerical methods for constrained optimization , 1976 .

[2]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3]  Pierre Baldi,et al.  A bottom-up model of spatial attention predicts human error patterns in rapid scene recognition. , 2007, Journal of vision.

[4]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[5]  Katsumi Aoki,et al.  Recent development of flow visualization , 2004, J. Vis..

[6]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[7]  Shrikanth S. Narayanan,et al.  Saliency-driven unstructured acoustic scene classification using latent perceptual indexing , 2009, 2009 IEEE International Workshop on Multimedia Signal Processing.

[8]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[9]  Andy Adler,et al.  Calculation of a Composite DET Curve , 2005, AVBPA.

[10]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[11]  Partha Niyogi,et al.  Distinctive feature detection using support vector machines , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  S. Yantis,et al.  Visual Attention: Bottom-Up Versus Top-Down , 2004, Current Biology.

[13]  Walter Schneider,et al.  Controlled and Automatic Human Information Processing: 1. Detection, Search, and Attention. , 1977 .

[14]  Christof Koch,et al.  Modeling attention to salient proto-objects , 2006, Neural Networks.

[15]  Pierre Baldi,et al.  Bayesian surprise attracts human attention , 2005, Vision Research.

[16]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[17]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[18]  Hugo Van hamme,et al.  Advances in Missing Feature Techniques for Robust Large-Vocabulary Continuous Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[21]  Bert De Coensel,et al.  A computational model for auditory saliency of environmental sound. , 2009 .

[22]  Shrikanth S. Narayanan,et al.  Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  J. Licklider,et al.  A duplex theory of pitch perception , 1951, Experientia.