Joint audio-visual bi-modal codewords for video event detection

Joint audio-visual patterns often exist in videos and provide strong multi-modal cues for detecting multimedia events. However, conventional methods generally fuse the visual and audio information only at a superficial level, without adequately exploring deep intrinsic joint patterns. In this paper, we propose a joint audio-visual bi-modal representation, called bi-modal words. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to construct the bi-modal words that reveal the joint patterns across modalities. Finally, different pooling strategies are employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations that are fed to subsequent multimedia event classifiers. We experimentally show that the proposed multi-modal feature achieves statistically significant performance gains over methods using individual visual and audio features alone and alternative multi-modal fusion methods. Moreover, we found that average pooling is the most suitable strategy for bi-modal feature generation.

[1]  Silvio Savarese,et al.  Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[2]  Louis C. W. Pols,et al.  Spectral analysis and identification of Dutch vowels in monosyllabic words , 1977 .

[3]  Dong Liu,et al.  BBN VISER TRECVID 2011 Multimedia Event Detection System , 2011, TRECVID.

[4]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[6]  Manuele Bicego,et al.  Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.

[7]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[8]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[9]  Florian Metze,et al.  Informedia @ TRECVID 2011 , 2011 .

[10]  Mubarak Shah,et al.  Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[11]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[12]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[13]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[14]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[16]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17]  Alexander C. Loui,et al.  Audio-visual grouplet: temporal audio-visual interactions for general video concept classification , 2011, ACM Multimedia.

[18]  Shih-Fu Chang,et al.  Short-term audio-visual atoms for generic video concept classification , 2009, ACM Multimedia.

[19]  Dong Liu,et al.  Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.