论文信息 - Joint audio-visual bi-modal codewords for video event detection

Joint audio-visual bi-modal codewords for video event detection

Joint audio-visual patterns often exist in videos and provide strong multi-modal cues for detecting multimedia events. However, conventional methods generally fuse the visual and audio information only at a superficial level, without adequately exploring deep intrinsic joint patterns. In this paper, we propose a joint audio-visual bi-modal representation, called bi-modal words. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to construct the bi-modal words that reveal the joint patterns across modalities. Finally, different pooling strategies are employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations that are fed to subsequent multimedia event classifiers. We experimentally show that the proposed multi-modal feature achieves statistically significant performance gains over methods using individual visual and audio features alone and alternative multi-modal fusion methods. Moreover, we found that average pooling is the most suitable strategy for bi-modal feature generation.

[1] Silvio Savarese,et al. Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[2] Louis C. W. Pols,et al. Spectral analysis and identification of Dutch vowels in monosyllabic words , 1977 .

[3] Dong Liu,et al. BBN VISER TRECVID 2011 Multimedia Event Detection System , 2011, TRECVID.

[4] Nebojsa Jojic,et al. A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[5] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[6] Manuele Bicego,et al. Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.

[7] Cordelia Schmid,et al. Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[8] Gabriela Csurka,et al. Visual categorization with bags of keypoints , 2002, eccv 2004.

[9] Florian Metze,et al. Informedia @ TRECVID 2011 , 2011 .

[10] Mubarak Shah,et al. Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[11] Shih-Fu Chang,et al. Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[12] Juergen Luettin,et al. Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[13] Jean Ponce,et al. A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[14] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15] Inderjit S. Dhillon,et al. Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[16] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17] Alexander C. Loui,et al. Audio-visual grouplet: temporal audio-visual interactions for general video concept classification , 2011, ACM Multimedia.

[18] Shih-Fu Chang,et al. Short-term audio-visual atoms for generic video concept classification , 2009, ACM Multimedia.

[19] Dong Liu,et al. Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.