Saliency-maximized audio visualization and efficient audio-visual browsing for faster-than-real-time human acoustic event detection

Browsing large audio archives is challenging because of the limitations of human audition and attention. However, this task becomes easier with a suitable visualization of the audio signal, such as a spectrogram transformed to make unusual audio events salient. This transformation maximizes the mutual information between an isolated event's spectrogram and an estimate of how salient the event appears in its surrounding context. When such spectrograms are computed and displayed with fluid zooming over many temporal orders of magnitude, sparse events in long audio recordings can be detected more quickly and more easily. In particular, in a 1/10-real-time acoustic event detection task, subjects who were shown saliency-maximized rather than conventional spectrograms performed significantly better. Saliency maximization also improves the mutual information between the ground truth of nonbackground sounds and visual saliency, more than other common enhancements to visualization.

[1]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[2]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[3]  J. G. Hollands,et al.  Engineering Psychology and Human Performance , 1984 .

[4]  Alan C. Bovik,et al.  The Essential Guide to Image Processing , 2009, J. Electronic Imaging.

[5]  Christof Koch,et al.  Feature combination strategies for saliency-based visual attention systems , 2001, J. Electronic Imaging.

[6]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[7]  John R. Anderson Cognitive Psychology and Its Implications , 1980 .

[8]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[9]  Thomas S. Huang,et al.  Improving faster-than-real-time human acoustic event detection by saliency-maximized audio visualization , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[11]  Paul Wintz,et al.  Digital image processing (2nd ed.) , 1987 .

[12]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[13]  Paris Smaragdis,et al.  Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[14]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[15]  Camille Goudeseune Effective browsing of long audio recordings , 2012, IMMPD '12.

[16]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  Jorge Herbert de Lira,et al.  Two-Dimensional Signal and Image Processing , 1989 .

[19]  H.M. Wechsler,et al.  Digital image processing, 2nd ed. , 1981, Proceedings of the IEEE.

[20]  H. Bastian Sensation and Perception.—I , 1869, Nature.

[21]  Tomasz Letowski,et al.  Detection and Localization of Magazine Insertion Clicks in Various Environmental Noises , 2007 .

[22]  J. Smith,et al.  Establishing a gold standard for manual cough counting: video versus digital audio recordings , 2006, Cough.

[23]  Ming Liu,et al.  HMM-Based Acoustic Event Detection with AdaBoost Feature Selection , 2007, CLEAR.

[24]  Min Chen,et al.  Eurographics/ Ieee-vgtc Symposium on Visualization 2010 a Salience-based Quality Metric for Visualization , 2022 .

[25]  Barry Arons,et al.  SpeechSkimmer: a system for interactively skimming recorded speech , 1997, TCHI.

[26]  Henrik I. Christensen,et al.  Computational visual attention systems and their cognitive foundations: A survey , 2010, TAP.

[27]  Jui Ting Huang,et al.  Multimodal speech and audio user interfaces for K-12 outreach , 2011 .

[28]  C. Wickens Engineering psychology and human performance, 2nd ed. , 1992 .

[29]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[30]  Ruth Rosenholtz,et al.  Do predictions of visual perception aid design? , 2011, TAP.

[31]  J. Smith,et al.  The description of cough sounds by healthcare professionals , 2006, Cough.

[32]  BorjiAli,et al.  State-of-the-Art in Visual Attention Modeling , 2013 .

[33]  Christof Koch,et al.  Modeling attention to salient proto-objects , 2006, Neural Networks.

[34]  Arthur F. Kramer,et al.  Please Scroll down for Article Visual Cognition Transfer of Information into Working Memory during Attentional Capture , 2022 .

[35]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[36]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Andrey Temko,et al.  ACOUSTIC EVENT DETECTION AND CLASSIFICATION IN SMART-ROOM ENVIRONMENTS: EVALUATION OF CHIL PROJECT SYSTEMS , 2006 .