Echoic log-surprise: A multi-scale scheme for acoustic saliency detection

Abstract Perceptual signals such as acoustic or visual cues carry a massive amount of information. From a human perspective, this problem is solved by means of cognitive mechanisms related to attention. In particular, saliency is a property of particular stimuli that makes them stand from others to allow the brain to take decisions about their relevance in the process of exploring the world. For artificial intelligence systems it is advantageous to mimic these mechanisms. Visual saliency algorithms have been successfully employed in tasks such as medical diagnosis, detection of violent scenes, environment understanding made by robots, etc. In contrast, computational models of the acoustic saliency mechanisms are less extended. In this context, we propose a novel acoustic saliency algorithm to be used by intelligent and expert systems facing tasks such as sound detection and classification, early alarm, surveillance, robotic exploration of the surroundings, among many other applications. This technique, we termed echoic log-surprise, combines an unsupervised statistical approach based on Bayesian log-surprise and the biological concept of echoic or Auditory Sensory Memory. Our algorithm computes several independent log-surprise cues in parallel considering a wide range of memory values, with the aim of leveraging saliency information from different temporal scales. Then, we explore several statistical metrics to combine these multi-scale signals in a single temporal saliency signal including Renyi entropy, Jensen-Shannon divergence, Cramer or Bhattacharyya distances. We have adopted Acoustic Event Detection tasks as adequate proxies to test its performance. Results show that the proposed echoic log-surprise method outperforms classical acoustic detection techniques commonly deployed in intelligent and expert systems, such as energy thresholding or voice activity detection, and it also achieves better results than some other state-of-the-art acoustic saliency algorithms, such as Kalinli’s and conventional log-surprise.

[1]  Jörn Anemüller,et al.  Spectro-Temporal Gabor Filterbank Features for Acoustic Event Detection , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Garrison W. Cottrell,et al.  Auditory Saliency Using Natural Statistics , 2012, CogSci.

[3]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[4]  E. Schröger Mismatch Negativity: A Microphone into Auditory Memory , 2007 .

[5]  P. Ullsperger,et al.  Mismatch negativity in event-related potentials to auditory stimuli as a function of varying interstimulus interval. , 1992, Psychophysiology.

[6]  Rainer Stiefelhagen,et al.  “Wow!” Bayesian surprise for salient acoustic event detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Jimmy Ludeña-Choez,et al.  Bird sound spectrogram decomposition through Non-Negative Matrix Factorization for the acoustic classification of bird species , 2017, PloS one.

[8]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[9]  Rainer Stiefelhagen,et al.  Multimodal saliency-based attention for object-based scene analysis , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Jayanthi Sivaswamy,et al.  Visual saliency based bright lesion detection and discrimination in retinal images , 2013, 2013 IEEE 10th International Symposium on Biomedical Imaging.

[11]  Hyeontaek Lim,et al.  Formant-Based Robust Voice Activity Detection , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[13]  W. von Suchodoletz,et al.  Auditory sensory memory and language abilities in former late talkers: a mismatch negativity study. , 2010, Psychophysiology.

[14]  Andrey Temko,et al.  CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.

[15]  Christine Fernandez-Maloigne,et al.  Multi-view visual saliency-based MRI classification for alzheimer's disease diagnosis , 2017, 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA).

[16]  R. Hari,et al.  The Human Auditory Sensory Memory Trace Persists about 10 sec: Neuromagnetic Evidence , 1993, Journal of Cognitive Neuroscience.

[17]  N. Cowan,et al.  Electrophysiological evidence of developmental changes in the duration of auditory sensory memory. , 1999, Developmental psychology.

[18]  Sergios Theodoridis,et al.  Multimodal and ontology-based fusion approaches of audio and visual processing for violence detection in movies , 2011, Expert Syst. Appl..

[19]  W. Suchodoletz,et al.  Development of auditory sensory memory from 2 to 6 years: an MMN study , 2008, Journal of Neural Transmission.

[20]  Alexandre Bernardino,et al.  Multimodal saliency-based bottom-up attention a framework for the humanoid robot iCub , 2008, 2008 IEEE International Conference on Robotics and Automation.

[21]  Sung Wook Baik,et al.  Visual saliency models for summarization of diagnostic hysteroscopy videos in healthcare systems , 2016, SpringerPlus.

[22]  Francesco Piazza,et al.  An integrated system for voice command recognition and emergency detection based on audio signals , 2015, Expert Syst. Appl..

[23]  Shrikanth S. Narayanan,et al.  Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  N. Cowan On short and long auditory stores. , 1984, Psychological bulletin.

[25]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[26]  Richa Singh,et al.  Saliency based mass detection from screening mammograms , 2014, Signal Process..

[27]  Matthias Bethge,et al.  DeepGaze II: Reading fixations from deep features trained on object recognition , 2016, ArXiv.

[28]  Mounya Elhilali,et al.  A temporal saliency map for modeling auditory attention , 2012, 2012 46th Annual Conference on Information Sciences and Systems (CISS).

[29]  Harvey Fletcher,et al.  Loudness, its definition, measurement and calculation , 1933 .

[30]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[31]  Yang Yi,et al.  Human action recognition with salient trajectories , 2013, Signal Process..

[32]  Qi Zhao,et al.  Learning saliency-based visual attention: A review , 2013, Signal Process..

[33]  Qiang Huang,et al.  Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Birger Kollmeier,et al.  Classifier Architectures for Acoustic Scenes and Events: Implications for DNNs, TDNNs, and Perceptual Features from DCASE 2016 , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Mounya Elhilali,et al.  Modelling auditory attention , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[36]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[37]  Jesús B. Alonso,et al.  Automatic anuran identification using noise removal and audio activity detection , 2017, Expert Syst. Appl..

[38]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[39]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[40]  P. Michie,et al.  Auditory sensory memory and the aging brain: A mismatch negativity study , 2006, Neurobiology of Aging.

[41]  Tim K Marks,et al.  SUN: A Bayesian framework for saliency using natural statistics. , 2008, Journal of vision.

[42]  Wei-Ping Zhu,et al.  Design and Performance Analysis of Bayesian, Neyman–Pearson, and Competitive Neyman–Pearson Voice Activity Detectors , 2007, IEEE Transactions on Signal Processing.

[43]  Daniel P. W. Ellis,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016) , 2016 .

[44]  Noel E. O'Connor,et al.  SalGAN: Visual Saliency Prediction with Generative Adversarial Networks , 2017, ArXiv.

[45]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Pierre Baldi,et al.  A principled approach to detecting surprising events in video , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[47]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[48]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[49]  Meng Wang,et al.  Image retargeting with a 3D saliency model , 2015, Signal Process..