Duration dependent input output markov models for audio-visual event detection

Detecting semantic events from audio-visual data with Spatiotemporal support is a challenging multimedia Understanding problem. The difficulty lies in the gap that exists between low level media features and high level semantic concept. We present a duration dependent input output Markov model (DDIOMM) to detect events based on multiple modalities. The DDIOMM combines the ability to model nonexponential duration densities with the mapping of input sequences to output sequences. In spirit it resembles the IOHMMs [1] as well as inhomogeneousHMMs [2]. We use the DDIOMM to model the audio-visual event explosion. We compare the detection performance of the DDIOMM with the IOMM as well as the HMM. Experiments reveal that modeling of duration improves detection performance.

[1]  Milind R. Naphade,et al.  Multimodal pattern matching for audio-visual query and retrieval , 2001, IS&T/SPIE Electronic Imaging.

[2]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[3]  Jay G. Wilpon,et al.  Modeling state durations in hidden Markov models for automatic speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Vladimir Pavlovic,et al.  A Bayesian framework for combining gene predictions , 2002, Bioinform..

[5]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Vladimir Pavlovic,et al.  Audio-visual speaker detection using dynamic Bayesian networks , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[8]  Yoshua Bengio,et al.  Input-output HMMs for sequence processing , 1996, IEEE Trans. Neural Networks.

[9]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[10]  Ashutosh Garg,et al.  S ampling Based EM Algorithm , 2000 .

[11]  Dan Roth,et al.  Understanding Probabilistic Classifiers , 2001, ECML.

[12]  Nicu Sebe,et al.  Facial expression recognition from video sequences , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[13]  Milind R. Naphade,et al.  Semantic video indexing using a probabilistic framework , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[14]  Milind R. Naphade,et al.  Stochastic modeling of soundtrack for efficient segmentation and indexing of video , 1999, Electronic Imaging.

[15]  Takeo Kanade,et al.  Semantic analysis for video contents extraction—spotting by association in news video , 1997, MULTIMEDIA '97.

[16]  nSo WCk,et al.  MIHMM: Mutual Information Hidden Markov Models , 2002 .

[17]  Thomas S. Huang,et al.  Fusion of global and local information for object detection , 2002, Object recognition supported by user interaction for service robots.

[18]  Dan Roth,et al.  On generalization bounds, projection profile, and margin distribution , 2002, ICML.

[19]  Vladimir Pavlovic,et al.  Bayesian networks as ensemble of classifiers , 2002, Object recognition supported by user interaction for service robots.

[20]  Dan Roth,et al.  Learning Coherent Concepts , 2001, ALT.

[21]  Nicu Sebe,et al.  Emotion recognition using a Cauchy Naive Bayes classifier , 2002, Object recognition supported by user interaction for service robots.