Video event classification using string kernels

Event recognition is a crucial task to provide high-level semantic description of the video content. The bag-of-words (BoW) approach has proven to be successful for the categorization of objects and scenes in images, but it is unable to model temporal information between consecutive frames. In this paper we present a method to introduce temporal information for video event recognition within the BoW approach. Events are modeled as a sequence composed of histograms of visual features, computed from each frame using the traditional BoW. The sequences are treated as strings (phrases) where each histogram is considered as a character. Event classification of these sequences of variable length, depending on the duration of the video clips, are performed using SVM classifiers with a string kernel that uses the Needlemann-Wunsch edit distance. Experimental results, performed on two domains, soccer videos and a subset of TRECVID 2005 news videos, demonstrate the validity of the proposed approach.

[1]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Noel E. O'Connor,et al.  Event detection in field sports video using audio-visual features and a support vector Machine , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[5]  Horst Bunke,et al.  Edit distance-based kernel functions for structural pattern classification , 2006, Pattern Recognit..

[6]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[7]  Alberto Del Bimbo,et al.  Semantic annotation of soccer videos by visual instance clustering and spatial/temporal reasoning in ontologies , 2010, Multimedia Tools and Applications.

[8]  Shahram Ebadollahi,et al.  Visual Event Detection using Multi-Dimensional Concept Dynamics , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[9]  Michal Irani,et al.  Detecting Irregularities in Images and in Video , 2005, ICCV.

[10]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[11]  Chong-Wah Ngo,et al.  Video event detection using motion relativity and visual relatedness , 2008, ACM Multimedia.

[12]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[14]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[15]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[16]  C. Berg,et al.  Harmonic Analysis on Semigroups , 1984 .

[17]  Bernard Haasdonk,et al.  Feature space interpretation of SVMs with indefinite kernels , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Nuno Vasconcelos,et al.  A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications , 2003, NIPS.

[19]  Shuicheng Yan,et al.  SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[20]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[21]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[22]  Jieping Ye,et al.  Training SVM with indefinite kernels , 2008, ICML '08.

[23]  Svetha Venkatesh,et al.  Recognising online spatial activities using a bioinformatics inspired sequence alignment approach , 2008, Pattern Recognit..

[24]  Jun Yang,et al.  Exploring temporal consistency for video analysis and retrieval , 2006, MIR '06.

[25]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[26]  Claus Bahlmann,et al.  Online handwriting recognition with support vector machines - a kernel approach , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[27]  Shaogang Gong,et al.  Incremental and adaptive abnormal behaviour detection , 2008, Comput. Vis. Image Underst..

[28]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[29]  Alexandre d'Aspremont,et al.  Support vector machine classification with indefinite kernels , 2007, Math. Program. Comput..

[30]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[31]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[32]  Shih-Fu Chang,et al.  Revision of LSCOM Event/Activity Annotations , 2006 .

[33]  Samy Bengio,et al.  Semi-supervised adapted HMMs for unusual event detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[34]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[35]  Ramakant Nevatia,et al.  VERL: An Ontology Framework for Representing and Annotating Video Events , 2005, IEEE Multim..

[36]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[38]  Alberto Del Bimbo,et al.  Learning Rules for Semantic Video Event Annotation , 2008, VISUAL.

[39]  Pietro Perona,et al.  A sparse object category model for efficient learning and exhaustive recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  Milind R. Naphade,et al.  Classification of video events using 4-dimensional time-compressed motion features , 2007, CIVR '07.

[41]  Dong Xu,et al.  Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[43]  Philip E. Gill,et al.  Practical optimization , 1981 .

[44]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Alberto Del Bimbo,et al.  Action Categorization in Soccer Videos Using String Kernels , 2009, 2009 Seventh International Workshop on Content-Based Multimedia Indexing.

[46]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[47]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..