Feature selection for multimodal: acoustic event detection

The detection of the Acoustic Events (AEs) naturally produced in a meeting room may help to describe the human and social activity. The automatic description of interactions between humans and environment can be useful for providing: implicit assistance to the people inside the room, context-aware and content-aware information requiring a minimum of human attention or interruptions, support for high-level analysis of the underlying acoustic scene, etc. On the other hand, the recent fast growth of available audio or audiovisual content strongly demands tools for analyzing, indexing, searching and retrieving the available documents. Given an audio document, the first processing step usually is audio segmentation (AS), i.e. the partitioning of the input audio stream into acoustically homogeneous regions which are labelled according to a predefined broad set of classes like speech, music, noise, etc. Acoustic event detection (AED) is the objective of this thesis work. A variety of features coming not only from audio but also from the video modality is proposed to deal with that detection problem in meeting-room and broadcast news domains. Two basic detection approaches are investigated in this work: a joint segmentation and classification using Hidden Markov Models (HMMs) with Gaussian Mixture Densities (GMMs), and a detection-by-classification approach using discriminative Support Vector Machines (SVMs). For the first case, a fast one-pass-training feature selection algorithm is developed in this thesis to select, for each AE class, the subset of multimodal features that shows the best detection rate. AED in meeting-room environments aims at processing the signals collected by distant microphones and video cameras in order to obtain the temporal sequence of (possibly overlapped) AEs that have been produced in the room. When applied to interactive seminars with a certain degree of spontaneity, the detection of acoustic events from only the audio modality alone shows a large amount of errors, which is mostly due to the temporal overlaps of sounds. This thesis includes several novelties regarding the task of multimodal AED. Firstly, the use of video features. Since in the video modality the acoustic sources do not overlap (except for occlusions), the proposed features improve AED in such rather spontaneous scenario recordings. Secondly, the inclusion of acoustic localization features, which, in combination with the usual spectro-temporal audio features, yield a further improvement in recognition rate. Thirdly, the comparison of feature-level and decision-level fusion strategies for the combination of audio and video modalities. In the later case, the system output scores are combined using two statistical approaches: weighted arithmetical mean and fuzzy integral. On the other hand, due to the scarcity of annotated multimodal data, and, in particular, of data with temporal sound overlaps, a new multimodal database with a rich variety of meeting-room AEs has been recorded and manually annotated, and it has been made publicly available for research purposes.

[1]  Guillermo Aradilla Acoustic Models for Posterior Features in Speech Recognition , 2008 .

[2]  Roy D. Patterson,et al.  A FUNCTIONAL MODEL OF NEURAL ACTIVITY PATTERNS AND AUDITORY IMAGES , 2004 .

[3]  Tomohiro Nakatani,et al.  Speech feature extraction method using subband-based periodicity and nonperiodicity decomposition. , 2006, The Journal of the Acoustical Society of America.

[4]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[5]  T Dau,et al.  A quantitative model of the "effective" signal processing in the auditory system. I. Model structure. , 1996, The Journal of the Acoustical Society of America.

[6]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[7]  Andrey Temko,et al.  Acoustic event detection in meeting-room environments , 2009, Pattern Recognit. Lett..

[8]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[9]  Shiqiang Yang,et al.  Motion based event recognition using HMM , 2002, Object recognition supported by user interaction for service robots.

[10]  Lie Lu,et al.  A robust audio classification and segmentation method , 2001, MULTIMEDIA '01.

[11]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[12]  Renate Sitte,et al.  Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[13]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[14]  Thomas S. Huang,et al.  Real-time lip tracking and bimodal continuous speech recognition , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[15]  P. Heil,et al.  Frequency and periodicity are represented in orthogonal maps in the human auditory cortex: evidence from magnetoencephalography , 1997, Journal of Comparative Physiology A.

[16]  Rainer Gruhn,et al.  Comparing Linear Feature Space Transformations for Correlated Features , 2008, PIT.

[17]  Julien Pinquier,et al.  Robust speech / music classification in audio documents , 2002, INTERSPEECH.

[18]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[19]  Gerhard Rigoll,et al.  Action Recognition in Meeting Scenarios using Global Motion Features , 2003 .

[20]  Min Xu,et al.  Multimodal Semantic Analysis and Annotation for Basketball Video , 2006, EURASIP J. Adv. Signal Process..

[21]  R. Leonardi,et al.  An overview of multi-modal techniques for the characterization of sport programmes , 2003 .

[22]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[23]  A. Gareta,et al.  A multi-microphone approach to speech processing in a smart-room environment , 2007 .

[24]  Montse Pardàs,et al.  Particle filtering and sparse sampling for multi-person 3D tracking , 2008, 2008 15th IEEE International Conference on Image Processing.

[25]  Zhu Liu,et al.  Integration of multimodal features for video scene classification based on HMM , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[26]  Martial Hebert,et al.  Volumetric Features for Video Event Detection , 2010, International Journal of Computer Vision.

[27]  Selwyn Piramuthu,et al.  Artificial Intelligence and Information Technology Evaluating feature selection methods for learning in data mining applications , 2004 .

[28]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[29]  Qi Tian,et al.  A fusion scheme of visual and auditory modalities for event detection in sports video , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[30]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Derek Hoiem,et al.  SOLAR: sound object localization and retrieval in complex audio environments , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[32]  Douglas D. O'Shaughnessy,et al.  An intuitive class discriminability measure for feature selection in a speech recognition system , 2008, INTERSPEECH.

[33]  David Gerhard Perceptual features for a fuzzy speech-song classification , 2002, ICASSP.

[34]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[35]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Biing-Hwang Juang,et al.  Auditory perception and cognition , 2008, IEEE Signal Processing Magazine.

[37]  Qi Tian,et al.  Feature selection using principal feature analysis , 2007, ACM Multimedia.

[38]  Renate Sitte,et al.  Analysis of Speech Recognition Techniques for use in a Non-Speech Sound Recognition System , 2002 .

[39]  Taras Butko,et al.  Inclusion of Video Information for Detection of Acoustic Events Using the Fuzzy Integral , 2008, MLMI.

[40]  Harriet J. Nock,et al.  Semantic indexing of multimedia using audio, text and visual cues , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[41]  Aarnout Brombacher,et al.  Probability... , 2009, Qual. Reliab. Eng. Int..

[42]  Maja Pantic,et al.  Audiovisual discrimination between laughter and speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[45]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[46]  Taras Butko,et al.  Fusion of audio and video modalities for detection of acoustic events , 2008, INTERSPEECH.

[47]  Jan M. Van Campenhout On thePossible Orderings intheMeasurement Selection Problem , 1977 .

[48]  Steve Renals,et al.  Multimodal Integration for Meeting Group Action Segmentation and Recognition , 2005, MLMI.

[49]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[50]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[51]  T. Dau Modeling auditory processing of amplitude modulation , 1997 .

[52]  Kuldip K. Paliwal,et al.  Noise compensation in a person verification system using face and multiple speech feature , 2003, Pattern Recognit..

[53]  Andrey Temko,et al.  Acoustic Event Detection and Classification , 2007, Computers in the Human Interaction Loop.

[54]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[55]  Yap-Peng Tan,et al.  Event detection using multimodal feature analysis , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[56]  Robert Frischholz,et al.  BioID: A Multimodal Biometric Identification System , 2000, Computer.

[57]  Rainer Stiefelhagen,et al.  Computers in the Human Interaction Loop , 2009, Human-Computer Interaction Series.

[58]  Mohan S. Kankanhalli,et al.  Harmonicity and dynamics-based features for audio , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[59]  Julien Pinquier,et al.  Jingle detection and identification in audio documents , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[60]  Takuya Fujishima,et al.  Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music , 1999, ICMC.

[61]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[62]  Linkai Bu,et al.  Perceptual speech processing and phonetic feature mapping for robust vowel recognition , 2000, IEEE Trans. Speech Audio Process..

[63]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[64]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[65]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[66]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Klaus Obermayer,et al.  Perceptual Representations for Classification of Everyday Sounds , 2007 .

[68]  Petri Toiviainen,et al.  MIR in Matlab (II): A Toolbox for Musical Feature Extraction from Audio , 2007, ISMIR.

[69]  David Pallett,et al.  A look at NIST'S benchmark ASR tests: past, present, and future , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[70]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[71]  YIFAN ZHANG,et al.  MULTIMODAL BASED HIGHLIGHT DETECTION IN BROADCAST SOCCER VIDEO , 2007 .

[72]  C.-C. Jay Kuo,et al.  Environmental sound recognition using MP-based features , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[73]  Michael Kleinschmidt,et al.  Robust speech recognition based on spectro-temporal processing , 2002 .

[74]  Sridha Sridharan,et al.  Adaptive Fusion of Speech and Lip Information for Robust Speaker Identification , 2001, Digit. Signal Process..

[75]  Thomas S. Huang,et al.  Feature analysis and selection for acoustic event detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[76]  Min Chen,et al.  Neural network based framework for goal event detection in soccer videos , 2005, Seventh IEEE International Symposium on Multimedia (ISM'05).