Improving acoustic event detection using generalizable visual features and multi-modality modeling

Acoustic event detection (AED) aims to identify both timestamps and types of multiple events and has been found to be very challenging. The cues for these events often times exist in both audio and vision, but not necessarily in a synchronized fashion. We study improving the detection and classification of the events using cues from both modalities. We propose optical flow based spatial pyramid histograms as a generalizable visual representation that does not require training on labeled video data. Hidden Markov models (HMMs) are used for audio-only modeling, and multi-stream HMMs or coupled HMMs (CHMM) are used for audio-visual joint modeling. To allow the flexibility of audio-visual state asynchrony, we explore effective CHMM training via HMM state-space mapping, parameter tying and different initialization schemes. The proposed methods successfully improve acoustic event classification and detection on a multimedia meeting room dataset containing eleven types of general non-speech events without using extra data resource other than the video stream accompanying the audio observations. Our systems perform favorably compared to previously reported systems leveraging ad-hoc visual cue detectors and localization information obtained from multiple microphones.

[1]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[2]  Milind R. Naphade,et al.  Duration dependent input output markov models for audio-visual event detection , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[3]  Chloé Clavel,et al.  Events Detection for an Audio-Based Surveillance System , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[4]  Regunathan Radhakrishnan,et al.  Audio-Visual Event Recognition with Application in Sports Video , 2005 .

[5]  Jean-Luc Schwartz,et al.  Comparing models for audiovisual fusion in a noisy-vowel recognition task , 1999, IEEE Trans. Speech Audio Process..

[6]  Thomas S. Huang,et al.  Feature analysis and selection for acoustic event detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[8]  Thomas S. Huang,et al.  Audio-visual speech modeling using coupled hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Satoshi Nakamura,et al.  Statistical multimodal integration for audio-visual speech processing , 2002, IEEE Trans. Neural Networks.

[10]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[11]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[12]  Thomas S. Huang,et al.  Real-world acoustic event detection , 2010, Pattern Recognit. Lett..

[13]  Taras Butko,et al.  Audiovisual event detection towards scene understanding , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[14]  Mitch Weintraub,et al.  Using speech/non-speech detection to bias recognition search on noisy data , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[17]  Pinar Duygulu Sahin,et al.  Human action recognition with line and flow histograms , 2008, 2008 19th International Conference on Pattern Recognition.