论文信息 - Short-term audio-visual atoms for generic video concept classification

Short-term audio-visual atoms for generic video concept classification

We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio features. An effective algorithm, named Short-Term Region tracking with joint Point Tracking and Region Segmentation (STR-PTRS), is developed to extract S-AVAs from generic videos under challenging conditions such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera. Discriminative audio-visual codebooks are constructed on top of S-AVAs using Multiple Instance Learning. Codebook-based features are generated for semantic concept detection. We extensively evaluate our algorithm over Kodak's consumer benchmark video set from real users. Experimental results confirm significant performance improvements - over 120% MAP gain compared to alternative approaches using static region segmentation without temporal tracking. The joint audio-visual features also outperform visual features alone by an average of 8.5% (in terms of AP) over 21 concepts, with many concepts achieving more than 20%.

[1] Huiyu Zhou,et al. Object tracking using SIFT features and mean shift , 2009, Comput. Vis. Image Underst..

[2] Carlo Tomasi,et al. Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[3] Sacha Krstulovic,et al. Mptk: Matching Pursuit Made Tractable , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4] J. Friedman. Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[5] B. S. Manjunath,et al. Unsupervised Segmentation of Color-Texture Regions in Images and Video , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[6] C.-C. Jay Kuo,et al. Environmental sound recognition using MP-based features , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Dong Xu,et al. Columbia University TRECVID-2006 Video Search and High-Level Feature Extraction , 2006, TRECVID.

[8] Trevor Darrell,et al. The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9] Takeo Kanade,et al. An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[10] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11] Bohyung Han,et al. Incremental density approximation and kernel-based Bayesian filtering for object tracking , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[12] David J. Fleet,et al. Robust Online Appearance Models for Visual Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[13] Andrew Blake,et al. Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications , 1996, ECCV.

[14] Paul Over,et al. Evaluation campaigns and TRECVid , 2006, MIR '06.

[15] Tomás Lozano-Pérez,et al. A Framework for Multiple-Instance Learning , 1997, NIPS.

[16] Newton Lee,et al. ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMCCAP) , 2007, CIE.

[17] Daphna Weinshall,et al. Biologically Motivated Audio-Visual Cue Integration for Object , 2008 .

[18] Stéphane Mallat,et al. Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[19] Winston H. Hsu,et al. Brief Descriptions of Visual Features for Baseline TRECVID Concept Detectors , 2006 .

[20] Paul A. Viola,et al. Boosting Image Retrieval , 2004, International Journal of Computer Vision.

[21] W. Eric L. Grimson,et al. Learning Semantic Scene Models by Trajectory Analysis , 2006, ECCV.

[22] Manuele Bicego,et al. Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.

[23] Ling Chen,et al. Large head movement tracking using sift-based registration , 2007, ACM Multimedia.

[24] W. Eric L. Grimson,et al. Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[25] David S. Doermann,et al. Video retrieval using spatio-temporal descriptors , 2003, MULTIMEDIA '03.

[26] Jiebo Luo,et al. Large-scale multimodal semantic concept detection for consumer video , 2007, MIR '07.

[27] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[28] Nebojsa Jojic,et al. A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[29] Benoit Huet,et al. Analysis of vector space model and spatiotemporal segmentation for video indexing and retrieval , 2007, CIVR '07.

[30] Paul Over,et al. TREC video retrieval evaluation TRECVID , 2008 .

[31] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[32] Edward Y. Chang,et al. Multimodal information fusion for video concept detection , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[33] Daniel P. W. Ellis,et al. Fingerprinting to Identify Repeated Sound Events in Long-Duration Personal Audio Recordings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[34] Bohyung Han,et al. Extracting Moving People from Internet Videos , 2008, ECCV.

[35] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[36] Yoav Y. Schechner,et al. Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[37] Yixin Chen,et al. Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[38] Jing Hua,et al. Region-based Image Annotation using Asymmetrical Support Vector Machine-based Multiple-Instance Learning , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[39] Sadaoki Furui,et al. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images , 2007, EURASIP J. Audio Speech Music. Process..

[40] Jiebo Luo,et al. Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[41] Bohyung Han,et al. Kernel-based Bayesian filtering for object tracking , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).