The Robustness of Echoic Log-Surprise Auditory Saliency Detection

The concept of saliency describes how relevant a stimulus is for humans. This phenomenon has been studied under different perspectives and modalities, such as audio, visual, or both. It has been employed in intelligent systems to interact with their environment in an attempt to emulate or even outperform human behavior in tasks, such as surveillance and alarm systems or even robotics. In this paper, we focus on the aural modality and our goal consists in measuring the robustness of Echoic log-surprise in comparison with a set of auditory saliency techniques when tested on noisy environments for the task of saliency detection. The acoustic saliency methods that we have analyzed include Kalinli’s saliency model, Bayesian log-surprise, and our proposed algorithm, Echoic log-surprise. This last method combines an unsupervised approach based on the Bayesian log-surprise and the biological concept of echoic or auditory sensory memory by means of a statistical fusion scheme, where the use of different distance metrics or statistical divergences, such as Renyi’s or Jensen-Shannon’s among others, are considered. Additionally, for comparison purposes, we have also compared some classical onset detection techniques, such as those based on voice activity detection or energy thresholding. Results show that Echoic log-surprise outperforms the detection capabilities of the rest of the techniques analyzed in this paper under a great variety of noises and signal-to-noise ratios, corroborating its robustness in noisy environments. In particular, our algorithm with the Jensen-Shannon fusion scheme produces the best F-scores. With the aim of better understanding the behavior of Echoic log-surprise, we have also studied the influence of its control parameters, depth and memory, and their influence at different noise levels.

[1]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[2]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[3]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[4]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[6]  R. Hari,et al.  The Human Auditory Sensory Memory Trace Persists about 10 sec: Neuromagnetic Evidence , 1993, Journal of Cognitive Neuroscience.

[7]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[8]  Tobias Watzka,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) , 2018 .

[9]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[10]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[11]  W. A. Mvnso,et al.  Loudness , Its Definition , Measurement and Calculation , 2004 .

[12]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[13]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Josh H McDermott,et al.  Recovering sound sources from embedded repetition , 2011, Proceedings of the National Academy of Sciences.

[15]  W. von Suchodoletz,et al.  Auditory sensory memory and language abilities in former late talkers: a mismatch negativity study. , 2010, Psychophysiology.

[16]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[17]  Andrey Temko,et al.  CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.

[18]  Shrikanth S. Narayanan,et al.  Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Carmen Peláez-Moreno,et al.  Echoic log-surprise: A multi-scale scheme for acoustic saliency detection , 2018, Expert Syst. Appl..

[20]  N. Cowan On short and long auditory stores. , 1984, Psychological bulletin.

[21]  Rainer Stiefelhagen,et al.  “Wow!” Bayesian surprise for salient acoustic event detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Bhuvana Ramabhadran,et al.  Invariant Representations for Noisy Speech Recognition , 2016, ArXiv.

[23]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[24]  H G Vaughan,et al.  Electrophysiological evidence of developmental changes in the duration of auditory sensory memory. , 1999, Developmental psychology.

[25]  T. Andringa,et al.  DARES-G 1 : Database of Annotated Real-world Everyday Sounds , 2009 .

[26]  Maria Chait,et al.  The effect of distraction on change detection in crowded acoustic scenes , 2016, Hearing Research.

[27]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[28]  E. Schröger Mismatch Negativity: A Microphone into Auditory Memory , 2007 .

[29]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[30]  Karl J. Friston,et al.  Is predictability salient? A study of attentional capture by auditory patterns , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[31]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[32]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[33]  P. Ullsperger,et al.  Mismatch negativity in event-related potentials to auditory stimuli as a function of varying interstimulus interval. , 1992, Psychophysiology.

[34]  Yusuke Shinohara,et al.  Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.

[35]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[36]  W. Suchodoletz,et al.  Development of auditory sensory memory from 2 to 6 years: an MMN study , 2008, Journal of Neural Transmission.