Layered dynamic mixture model for pattern discovery in asynchronous multi-modal streams [video applications]

We propose a layered dynamic mixture model for asynchronous multi-modal fusion for unsupervised pattern discovery in video. The lower layer of the model uses generative temporal structures such as a hierarchical hidden Markov model to convert the audiovisual streams into mid-level labels, it also models the correlations in text with probabilistic latent semantic analysis. The upper layer fuses the statistical evidence across diverse modalities with a flexible meta-mixture model that assumes loose temporal correspondence. Evaluation on a large news database shows that multi-modal clusters have better correspondence to news topics than audio-visual clusters alone; novel analysis techniques suggest that meaningful clusters occur when the prediction of salient features by the model concurs with those shown in the story clusters.

[1]  Shih-Fu Chang,et al.  Discovery and fusion of salient multimodal features toward news story segmentation , 2003, IS&T/SPIE Electronic Imaging.

[2]  Ziyou Xiong,et al.  Generation of sports highlights using a combination of supervised & unsupervised learning in audio domain , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[3]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[4]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[5]  Eric Horvitz,et al.  Layered representations for learning and inferring office activity from multiple sensory channels , 2004, Comput. Vis. Image Underst..

[6]  Peter Hilton,et al.  The Algebra ℝ I , 1970 .

[7]  Shih-Fu Chang,et al.  Discovering meaningful multimedia patterns with audio-visual concepts and associated text , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[8]  Shih-Fu Chang,et al.  Unsupervised Mining of Statistical Temporal Structures in Video , 2003 .