Non-speech Acoustic Event Detection Using Multimodal Information

Non-speech acoustic event detection (AED) aims to recognize events that are relevant to human activities associated with audio information. Much previous research has been focused on restricted highlight events, and highly relied on ad-hoc detectors for these events. This thesis focuses on using multimodal data in order to make non-speech acoustic event detection and classification tasks more robust, requiring no expensive annotation. To be specific, the thesis emphasizes designing suitable feature representations for different modalities and fusing the information properly. Two cases are studied in this thesis: (1) Acoustic event detection in a meeting room scenario using single-microphone audio cues and single-camera visual cues. Non-speech event cues often exist in both audio and vision, but not necessarily in a synchronized fashion. We jointly model audio and visual cues in order to improve event detection using multistream HMMs and coupled HMMs (CHMM). Spatial pyramid histograms based on the optical flow are proposed as a generalizable visual representation that does not require training on labeled video data. In a multimedia meeting room nonspeech event detection task, the proposed methods outperform previously reported systems leveraging ad-hoc visual object detectors and sound localization information obtained from multiple microphones. (2) Multimodal feature representation for person detection at border crossings. Based on phenomenology of the differences between humans and four-legged animals, we propose using enhanced autocorrelation pattern for feature extraction for seismic sensors, and an exemplar selection framework for acoustic sensors. We also propose using temporal pattens from ultrasonic sensors. We perform decision and feature fusion to combine the information from all three modalities. From experimental results, we show that our proposed methods improve the robustness of the system.

[1]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[2]  Xiaodan Zhuang,et al.  Modeling audio and visual cues for real-world event detection , 2011 .

[3]  Pinar Duygulu Sahin,et al.  Human action recognition with line and flow histograms , 2008, 2008 19th International Conference on Pattern Recognition.

[4]  Mark Hasegawa-Johnson,et al.  Multi-sensory features for personnel detection at border crossings , 2011, 14th International Conference on Information Fusion.

[5]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[6]  L. Rothkrantz,et al.  Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[7]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[8]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[9]  Regunathan Radhakrishnan,et al.  Audio-Visual Event Recognition with Application in Sports Video , 2005 .

[10]  Mark Hasegawa-Johnson,et al.  Acoustic fall detection using Gaussian mixture models and GMM supervectors , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Thyagaraju Damarla Sensor fusion for ISR assets , 2010, Defense + Commercial Sensing.

[12]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[13]  Kate Saenko,et al.  AUDIOVISUAL SPEECH RECOGNITION WITH ARTICULATOR POSITIONS AS HIDDEN VARIABLES , 2007 .

[14]  Daniel P. McGaffigan,et al.  Spectrum analysis techniques for personnel detection using seismic sensors , 2003, SPIE Defense + Commercial Sensing.

[15]  Noel E. O'Connor,et al.  Event detection in field sports video using audio-visual features and a support vector Machine , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Theodore W. Berger,et al.  Cadence analysis of temporal gait patterns for seismic discrimination between human and quadruped footsteps , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  M. Hasegawa-Johnson,et al.  Exemplar Selection Methods to Distinguish Human from Animal Footsteps , 2011 .

[18]  Bhiksha Raj,et al.  Acoustic Doppler sonar for gait recogination , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[19]  Mitch Weintraub,et al.  Using speech/non-speech detection to bias recognition search on noisy data , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[20]  Taras Butko,et al.  Improving detection of acoustic events using audiovisual data and feature level fusion , 2009, INTERSPEECH.

[21]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[22]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[23]  Robert D. Nowak,et al.  Human Active Learning , 2008, NIPS.

[24]  Joemon M. Jose,et al.  Audio-Based Event Detection for Sports Video , 2003, CIVR.

[25]  Jean-Luc Schwartz,et al.  Comparing models for audiovisual fusion in a noisy-vowel recognition task , 1999, IEEE Trans. Speech Audio Process..

[26]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[27]  Thomas S. Huang,et al.  Real-world acoustic event detection , 2010, Pattern Recognit. Lett..

[28]  Taras Butko,et al.  Audiovisual event detection towards scene understanding , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[29]  Lance M. Kaplan,et al.  Human infrastructure & human activity detection , 2007, 2007 10th International Conference on Information Fusion.

[30]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[31]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[32]  Lawrence R. Rabiner,et al.  On the use of autocorrelation analysis for pitch detection , 1977 .

[33]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[34]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[35]  James M. Sabatier,et al.  Range limitation for seismic footstep detection , 2008, SPIE Defense + Commercial Sensing.

[36]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[37]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[38]  Thomas S. Huang,et al.  Feature analysis and selection for acoustic event detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Mark Hasegawa-Johnson,et al.  Improving acoustic event detection using generalizable visual features and multi-modality modeling , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[41]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[42]  James M. Sabatier,et al.  Human detection range by active Doppler and passive ultrasonic methods , 2008, SPIE Defense + Commercial Sensing.

[43]  M. Malfatti,et al.  Netcarity: Ambient technology to support older people at home , 2009 .

[44]  Julien Pinquier,et al.  Robust speech / music classification in audio documents , 2002, INTERSPEECH.

[45]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[46]  Milind R. Naphade,et al.  Duration dependent input output markov models for audio-visual event detection , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[47]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[48]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[49]  Andrey Temko,et al.  ACOUSTIC EVENT DETECTION AND CLASSIFICATION IN SMART-ROOM ENVIRONMENTS: EVALUATION OF CHIL PROJECT SYSTEMS , 2006 .

[50]  Thomas S. Huang,et al.  Bimodal speech recognition using coupled hidden Markov models , 2000, INTERSPEECH.

[51]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[52]  Satoshi Nakamura,et al.  Statistical multimodal integration for audio-visual speech processing , 2002, IEEE Trans. Neural Networks.

[53]  J. Smith,et al.  Establishing a gold standard for manual cough counting: video versus digital audio recordings , 2006, Cough.

[54]  Craig Stuart Sapp,et al.  Efficient Pitch Detection Techniques for Interactive Music , 2001, ICMC.

[55]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[56]  Matti Karjalainen,et al.  A computationally efficient multipitch analysis model , 2000, IEEE Trans. Speech Audio Process..