Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment

In this work, we systematically study the problem of event recognition in unconstrained news video sequences. We adopt the discriminative kernel-based method for which video clip similarity plays an important role. First, we represent a video clip as a bag of orderless descriptors extracted from all of the constituent frames and apply the earth mover's distance (EMD) to integrate similarities among frames from two clips. Observing that a video clip is usually comprised of multiple subclips corresponding to event evolution over time, we further build a multilevel temporal pyramid. At each pyramid level, we integrate the information from different subclips with Integer-value-constrained EMD to explicitly align the subclips. By fusing the information from the different pyramid levels, we develop temporally aligned pyramid matching (TAPM) for measuring video similarity. We conduct comprehensive experiments on the TRECVID 2005 corpus, which contains more than 6,800 clips. Our experiments demonstrate that (1) the TAPM multilevel method clearly outperforms single-level EMD (SLEMD) and (2) SLEMD outperforms keyframe and multiframe-based detection methods by a large margin. In addition, we conduct in-depth investigation of various aspects of the proposed techniques such as weight selection in SLEMD, sensitivity to temporal clustering, the effect of temporal alignment, and possible approaches for speedup. Extensive analysis of the results also reveals intuitive interpretation of video event recognition through video subclip alignment at different levels.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[3]  Rong Yan,et al.  Multi-Lingual Broadcast News Retrieval , 2006, TRECVID.

[4]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[6]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[7]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[8]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[9]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Shahram Ebadollahi,et al.  Visual Event Detection using Multi-Dimensional Concept Dynamics , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[11]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[13]  ZhangJ.,et al.  Local Features and Kernels for Classification of Texture and Object Categories , 2007 .

[14]  Hong-Jiang Zhang,et al.  An efficient and effective region-based image retrieval framework , 2004, IEEE Transactions on Image Processing.

[15]  Dong Xu,et al.  Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Michal Irani,et al.  Detecting Irregularities in Images and in Video , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[17]  Winston H. Hsu,et al.  Brief Descriptions of Visual Features for Baseline TRECVID Concept Detectors , 2006 .

[18]  R. Gray,et al.  A Generalization of Ornstein's $\bar d$ Distance with Applications to Information Theory , 1975 .

[19]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[20]  D. Goldsman Operations Research Models and Methods , 2003 .

[21]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[23]  Shih-Fu Chang,et al.  Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts , 2007 .

[24]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[25]  Samy Bengio,et al.  Semi-supervised adapted HMMs for unusual event detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[27]  Alexander G. Hauptmann,et al.  LSCOM Lexicon Definitions and Annotations (Version 1.0) , 2006 .

[28]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[29]  Nuno Vasconcelos,et al.  A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications , 2003, NIPS.

[30]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[31]  Yun Zhai,et al.  University of Central Florida at TRECVID 2006 High-Level Feature Extraction and Video Search , 2006, TRECVID.

[32]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[33]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[34]  Pietro Perona,et al.  Hybrid models for human motion recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[35]  S. Rachev The Monge–Kantorovich Mass Transference Problem and Its Stochastic Applications , 1985 .

[36]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[37]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[38]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[40]  Ashok Veeraraghavan,et al.  The Function Space of an Activity , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[41]  Svetha Venkatesh,et al.  Object labelling from human action recognition , 2003, Proceedings of the First IEEE International Conference on Pervasive Computing and Communications, 2003. (PerCom 2003)..

[42]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.