FODAVA-Partner: Visualizing Audio for Anomaly Detection

“Most people who handle money a lot (i.e. cashiers) can identify a lower-quality fake bill instantly just by touching it” [10]. Data analysts are like cashiers: a trained data analyst can detect anomalies “at a glance” when data is appropriately transformed. This is the goal of data visualization. This proposal addresses the type of audio anomalies that human data analysts hear instantly: angry shouting, trucks at midnight on a residential street, gunshots. The human ear detects anomalies of this type rapidly and with high accuracy. For example, rifle magazine insertion clicks are detected with 100% accuracy at 0 dB SNR in white noise, babble, or jungle noise [1]. Unfortunately, a data analyst can listen to only one sound at a time. Visualization shows the analyst many sounds at once, possibly allowing him or her to detect an anomaly several orders of magnitude faster than “real time.” This proposal aims to to render large audio data sets, comprising thousands of microphones or thousands of minutes, in the form of interactive graphics that reveal important anomalies at a glance. Precedents for such graphical rendering are familiar to audio professionals. A simple amplitudeversus-time graph reveals silences for a speech transcriber to skip past; a spectrogram reveals details of birdsong to an ornithologist. But many audio anomalies are not so easy to display: angry shouts versus enthusiastically shouted greetings; the clatter of overturned tables rather than mere dishwashing; spoken Thai in Berlin, or German in Bangkok. All these cases can submit to automatic anomaly detection, using probabilistic models of longand short-term spectral features of normal activity. Unfortunately, the state of the art in automatic audio event detection is not very accurate [112]. We propose to represent audio anomaly to the analyst using a type of overcomplete lossless encoding: automatic anomaly salience scores will be displayed together with raw audio features, allowing a human analyst to drill down at any point in the data in order to resolve discrepancies in the visible display. The goal of this proposal is to present to analysts a coherent visual summary of both probabilistic and raw spectral information. The measurable outcome of this research will be the speed with which analysts find audio anomalies that have been planted, by the experimenter, in a very large dataset. A successful research outcome will be a visual summary that lets the analyst detect most anomalies immediately (about 10,000× faster than real time), and all anomalies after brief interactive exploration (about 1,000× faster than real time). In short, the goal of this proposal is to transform, model, and reduce data for efficient effective visualization and analytic reasoning: • A simple time series is transformed into audio features and probabilistic model-based features, vastly reducing the quantity of data presented at one time to the analyst (fewer observations). • Multiple techniques of dimensionality reduction condense the breadth of the data presented to the analyst (fewer variables). • Computationally inexpensive multiscale caching of all layers, from summary variables down to the raw source data, supports interactive investigation of hypotheses at different spatial and temporal scales. • The interactive visualizations are efficient: caching increases the analyst’s decision rate. • The visualizations are effective: the decisions have a measurably low error rate.

[1]  O. Rioul,et al.  Wavelets and signal processing , 1991, IEEE Signal Processing Magazine.

[2]  DeLiang Wang,et al.  Schema-based modeling of phonemic restoration , 2003, INTERSPEECH.

[3]  J. Smith,et al.  Establishing a gold standard for manual cough counting: video versus digital audio recordings , 2006, Cough.

[4]  Mark Hasegawa-Johnson,et al.  Landmark-based speech recognition: report of the 2004 Johns Hopkins summer workshop , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Biing-Hwang Juang,et al.  Speech Analysis in a Model of the Central Auditory System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Hank Kaczmarski,et al.  Tele-sports and tele-dance: full-body network interaction , 2003, VRST '03.

[7]  Thomas S. Huang,et al.  Hmm-Based and Svm-Based Recognition of the Speech of Talkers With Spastic Dysarthria , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Ming Liu,et al.  HMM-Based Acoustic Event Detection with AdaBoost Feature Selection , 2007, CLEAR.

[9]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Daniel P. W. Ellis,et al.  Midlevel representations for computational auditory scene analysis: the Weft element , 1998 .

[11]  Camille Goudeseune,et al.  Composing Outdoor Augmented-Reality Sound Environments , 2001, ICMC.

[12]  Thomas S. Huang,et al.  Novel Gaussianized vector representation for improved natural scene categorization , 2010, Pattern Recognit. Lett..

[13]  Ronald R. Coifman,et al.  Entropy-based algorithms for best basis selection , 1992, IEEE Trans. Inf. Theory.

[14]  Jui Ting Huang,et al.  Multimodal speech and audio user interfaces for K-12 outreach , 2011 .

[15]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[16]  Marcelo Knörich Zuffo,et al.  Commodity Clusters for Immersive Projection Environments , 2002, SIGGRAPH 2002.

[17]  Gregory W. Wornell,et al.  Wavelet-based representations for a class of self-similar signals with application to fractal modulation , 1992, IEEE Trans. Inf. Theory.

[18]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[19]  Daniel P. W. Ellis,et al.  Model-Based Scene Analysis , 2005 .

[20]  Kunio Kashino,et al.  Application of the Bayesian probability network to music scene analysis , 1998 .

[21]  Lie Lu,et al.  Highlight sound effects detection in audio stream , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[22]  Kuansan Wang,et al.  Spectral shape analysis in the central auditory system , 1995, IEEE Trans. Speech Audio Process..

[23]  Julien Pinquier,et al.  Robust speech / music classification in audio documents , 2002, INTERSPEECH.

[24]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[25]  Milind R. Naphade,et al.  Duration dependent input output markov models for audio-visual event detection , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[26]  H. Sorenson,et al.  Recursive bayesian estimation using gaussian sums , 1971 .

[27]  Salah Bourennane,et al.  Whitening spatial correlation filtering for hyperspectral anomaly detection , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[28]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[29]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[30]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[31]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[32]  Yoichi Muraoka,et al.  Musical understanding at the beat level: real-time beat tracking for audio signals , 1998 .

[33]  Volker Hohmann,et al.  Computational auditory scene analysis by using statistics of high-dimensional speech dynamics and sound source direction , 2003, INTERSPEECH.

[34]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[35]  Richards J. Heuer,et al.  Psychology of Intelligence Analysis , 1999 .

[36]  Camille Goudeseune,et al.  Composing With Parameters for Synthetic Instruments , 2001 .

[37]  Daniel P. W. Ellis,et al.  The auditory organization of speech and other sources in listeners and computational models , 2001, Speech Commun..

[38]  Nima Mesgarani,et al.  Speech discrimination based on multiscale spectro-temporal modulations , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Mark Hasegawa-Johnson,et al.  Maximum mutual information based acoustic-features representation of phonological features for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40]  F. Goudail,et al.  Some practical issues in anomaly detection and exploitation of regions of interest in hyperspectral images. , 2006, Applied optics.

[41]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[42]  Ichiro Fujinaga,et al.  Extending Audacity for Audio Annotation , 2006, ISMIR.

[43]  Daniel P. W. Ellis Extracting information from music audio , 2006, CACM.

[44]  W. A. Mvnso,et al.  Loudness , Its Definition , Measurement and Calculation , 2004 .

[45]  Richard F. Lyon,et al.  Auditory model inversion for sound separation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[47]  Mark Hasegawa-Johnson,et al.  Distinctive feature based SVM discriminant features for improvements to phone recognition on telephone band speech , 2005, INTERSPEECH.

[48]  Andrey Temko,et al.  ACOUSTIC EVENT DETECTION AND CLASSIFICATION IN SMART-ROOM ENVIRONMENTS: EVALUATION OF CHIL PROJECT SYSTEMS , 2006 .

[49]  Ming Liu,et al.  Robust Analysis and Weighting on MFCC Components for Speech Recognition and Speaker Identification , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[50]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[51]  Mark Hasegawa-Johnson A Multi-Stream Approach to Audiovisual Automatic Speech Recognition , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[52]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[53]  Daniel P. W. Ellis,et al.  PREDICTION-DRIVEN COMPUTATIONAL AUDITORY SCENE ANALYSIS FOR DENSE SOUND MIXTURES , 1996 .

[54]  Guy J. Brown,et al.  Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[55]  Michael S. Lewicki,et al.  Efficient coding of natural sounds , 2002, Nature Neuroscience.

[56]  Hank Kaczmarski,et al.  APPLICATION FRAMEWORK FOR CANVAS THE VIRTUAL REALITY ENVIRONMENT FOR MUSEUMS , 2005 .

[57]  John B. Moore,et al.  On-line estimation of hidden Markov model parameters based on the Kullback-Leibler information measure , 1993, IEEE Trans. Signal Process..

[58]  John C. Hart,et al.  The CAVE: audio visual experience automatic virtual environment , 1992, CACM.

[59]  S. Shamma,et al.  An account of monaural phase sensitivity. , 2002, The Journal of the Acoustical Society of America.

[60]  Daniel P. W. Ellis,et al.  Classification-based melody transcription , 2006, Machine Learning.

[61]  Douglas E. Sturim,et al.  Speaker Verification using Text-Constrained Gaussian , 2002 .

[62]  D. A. Bertke,et al.  Finding events automatically in continuously sampled data streams via anomaly detection , 2000, Proceedings of the IEEE 2000 National Aerospace and Electronics Conference. NAECON 2000. Engineering Tomorrow (Cat. No.00CH37093).

[63]  Thomas S. Huang,et al.  Intersession variability compensation for language detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  Chloé Clavel,et al.  Events Detection for an Audio-Based Surveillance System , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[65]  Michaël Titus Maria Scheffers,et al.  Sifting vowels. Auditory pitch analysis and sound segregation. , 1983 .

[66]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[67]  R Meddis,et al.  Modeling the identification of concurrent vowels with different fundamental frequencies. , 1992, The Journal of the Acoustical Society of America.

[68]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[69]  Jung-Min Park,et al.  An overview of anomaly detection techniques: Existing solutions and latest technological trends , 2007, Comput. Networks.

[70]  Gregory W. Wornell,et al.  A Karhunen-Loève-like expansion for 1/f processes via wavelets , 1990, IEEE Trans. Inf. Theory.

[71]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[72]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[73]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[75]  Frank Ehlers,et al.  Blind separation of convolutive mixtures and an application in automatic speech recognition in a noisy environment , 1997, IEEE Trans. Signal Process..

[76]  Jiří Navrátil Automatic Language Identification , 2006 .

[77]  John R. Hershey,et al.  Single microphone source separation using high resolution signal reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[78]  Amro El-Jaroudi,et al.  New signal decomposition method based speech enhancement , 2007, Signal Process..

[79]  Guy J. Brown,et al.  A comparison of auditory and blind separation techniques for speech segregation , 2001, IEEE Trans. Speech Audio Process..

[80]  Camille Goudeseune,et al.  Interpolated mappings for musical instruments , 2002, Organised Sound.

[81]  J. Stephen Downie,et al.  Music information retrieval , 2005, Annu. Rev. Inf. Sci. Technol..

[82]  Mark Hasegawa-Johnson,et al.  Optimal Multi-Microphone Speech Enhancement in Cars , 2012 .

[83]  Ian Foster,et al.  Globus GridFTP: What's New in 2007 (Invited Paper) , 2007 .

[84]  Tieniu Tan,et al.  Similarity based vehicle trajectory clustering and anomaly detection , 2005, IEEE International Conference on Image Processing 2005.

[85]  Mark Hasegawa-Johnson,et al.  Maximum conditional mutual information projection for speech recognition , 2003, INTERSPEECH.

[86]  Svante Granqvist,et al.  The correlogram: a visual display of periodicity. , 2003, The Journal of the Acoustical Society of America.

[87]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[88]  Joemon M. Jose,et al.  Audio-Based Event Detection for Sports Video , 2003, CIVR.

[89]  Biing-Hwang Juang,et al.  Maximum likelihood estimation for multivariate mixture observations of markov chains , 1986, IEEE Trans. Inf. Theory.

[90]  Israel Cohen,et al.  Anomaly detection based on an iterative local statistics approach , 2004, 2004 23rd IEEE Convention of Electrical and Electronics Engineers in Israel.

[91]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. , 1990, The Journal of the Acoustical Society of America.

[92]  Mark Hasegawa-Johnson,et al.  Normalized recognition of speech and audio events , 2011 .

[93]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[94]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[95]  Yun Fu,et al.  Lipreading by Locality Discriminant Graph , 2007, 2007 IEEE International Conference on Image Processing.

[96]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[97]  David Small,et al.  Case study: A virtual environment for genomic data visualization , 2002, IEEE Visualization, 2002. VIS 2002..

[98]  Mark Hasegawa-Johnson,et al.  Adaptation of tandem hidden Markov models for non‐speech audio event detection. , 2009 .

[99]  Mark Hasegawa-Johnson,et al.  Detection of Acoustic-Phonetic Landmarks in Mismatched Conditions using a Biomimetic Model of Human Auditory Processing , 2012, COLING.

[100]  Kate Saenko,et al.  AUDIOVISUAL SPEECH RECOGNITION WITH ARTICULATOR POSITIONS AS HIDDEN VARIABLES , 2007 .

[101]  Joseph F. Murray,et al.  Dictionary Learning Algorithms for Sparse Representation , 2003, Neural Computation.

[102]  Thomas S. Huang,et al.  Feature analysis and selection for acoustic event detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[103]  Bruno A Olshausen,et al.  Sparse coding of sensory inputs , 2004, Current Opinion in Neurobiology.

[104]  Andrey Temko,et al.  Classification of meeting-room acoustic events with support vector machines and variable-feature-set clustering , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[105]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[106]  Tomasz Letowski,et al.  Detection and Localization of Magazine Insertion Clicks in Various Environmental Noises , 2007 .

[107]  Daniel P. W. Ellis,et al.  Eigenrhythms: Drum pattern basis sets for classification and generation , 2004, ISMIR.

[108]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[109]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[110]  Daniel P. W. Ellis,et al.  Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures , 1999, Speech Commun..

[111]  Tomohiro Nakatani,et al.  A new speech enhancement: speech stream segregation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[112]  Kristin A. Cook,et al.  Illuminating the Path: The Research and Development Agenda for Visual Analytics , 2005 .

[113]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[114]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[115]  Camille Goudeseune,et al.  Myriad: scalable VR via peer-to-peer connectivity, PC clustering, and transient inconsistency , 2005, VRST '05.

[116]  Camille Goudeseune,et al.  Syzygy: native PC cluster VR , 2003, IEEE Virtual Reality, 2003. Proceedings..

[117]  Thomas S. Huang,et al.  Improving faster-than-real-time human acoustic event detection by saliency-maximized audio visualization , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).