Short-term audio-visual atoms for generic video concept classification

We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio features. An effective algorithm, named Short-Term Region tracking with joint Point Tracking and Region Segmentation (STR-PTRS), is developed to extract S-AVAs from generic videos under challenging conditions such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera. Discriminative audio-visual codebooks are constructed on top of S-AVAs using Multiple Instance Learning. Codebook-based features are generated for semantic concept detection. We extensively evaluate our algorithm over Kodak's consumer benchmark video set from real users. Experimental results confirm significant performance improvements - over 120% MAP gain compared to alternative approaches using static region segmentation without temporal tracking. The joint audio-visual features also outperform visual features alone by an average of 8.5% (in terms of AP) over 21 concepts, with many concepts achieving more than 20%.

[1]  Huiyu Zhou,et al.  Object tracking using SIFT features and mean shift , 2009, Comput. Vis. Image Underst..

[2]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Sacha Krstulovic,et al.  Mptk: Matching Pursuit Made Tractable , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[5]  B. S. Manjunath,et al.  Unsupervised Segmentation of Color-Texture Regions in Images and Video , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  C.-C. Jay Kuo,et al.  Environmental sound recognition using MP-based features , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Dong Xu,et al.  Columbia University TRECVID-2006 Video Search and High-Level Feature Extraction , 2006, TRECVID.

[8]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Bohyung Han,et al.  Incremental density approximation and kernel-based Bayesian filtering for object tracking , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[12]  David J. Fleet,et al.  Robust Online Appearance Models for Visual Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Andrew Blake,et al.  Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications , 1996, ECCV.

[14]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[15]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[16]  Newton Lee,et al.  ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMCCAP) , 2007, CIE.

[17]  Daphna Weinshall,et al.  Biologically Motivated Audio-Visual Cue Integration for Object , 2008 .

[18]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[19]  Winston H. Hsu,et al.  Brief Descriptions of Visual Features for Baseline TRECVID Concept Detectors , 2006 .

[20]  Paul A. Viola,et al.  Boosting Image Retrieval , 2004, International Journal of Computer Vision.

[21]  W. Eric L. Grimson,et al.  Learning Semantic Scene Models by Trajectory Analysis , 2006, ECCV.

[22]  Manuele Bicego,et al.  Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.

[23]  Ling Chen,et al.  Large head movement tracking using sift-based registration , 2007, ACM Multimedia.

[24]  W. Eric L. Grimson,et al.  Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  David S. Doermann,et al.  Video retrieval using spatio-temporal descriptors , 2003, MULTIMEDIA '03.

[26]  Jiebo Luo,et al.  Large-scale multimodal semantic concept detection for consumer video , 2007, MIR '07.

[27]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[28]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Benoit Huet,et al.  Analysis of vector space model and spatiotemporal segmentation for video indexing and retrieval , 2007, CIVR '07.

[30]  Paul Over,et al.  TREC video retrieval evaluation TRECVID , 2008 .

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  Edward Y. Chang,et al.  Multimodal information fusion for video concept detection , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[33]  Daniel P. W. Ellis,et al.  Fingerprinting to Identify Repeated Sound Events in Long-Duration Personal Audio Recordings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[34]  Bohyung Han,et al.  Extracting Moving People from Internet Videos , 2008, ECCV.

[35]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[36]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Yixin Chen,et al.  Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[38]  Jing Hua,et al.  Region-based Image Annotation using Asymmetrical Support Vector Machine-based Multiple-Instance Learning , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[39]  Sadaoki Furui,et al.  Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images , 2007, EURASIP J. Audio Speech Music. Process..

[40]  Jiebo Luo,et al.  Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[41]  Bohyung Han,et al.  Kernel-based Bayesian filtering for object tracking , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).