Informedia@TRECVID 2011: Surveillance Event Detection

This paper presents a generic event detection system evaluated in the Surveillance Event Detection (SED) task of TRECVID 2011 campaign. We investigate a generic statistical approach with spatio-temporal features applied to seven event classes, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, which is named as MoSIFT and generated by pair-wise video frames. Visual vocabularies are generated by cluster centers of MoSIFT features, which were sampled from the event part video clips. We also estimated the spatial distribution of actions by over generated person detection and background subtraction. Different slide window sizes and steps were adopted for different events by events’ duration prior. Several sets of one-against-all action classifiers were trained using cascade non-linear SVMs and Random Forest, which could improve the classification performance in unbalanced data just like the SED datasets. 9 runs results were presented with variations in i) Slide window size ii) step size of BOW, iii) classifier threshold and iv) classifiers. The performance shows improvement over last year on the event detection task.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[4]  Ivan Laptev,et al.  INRIA-WILLOW at TRECVID 2010 : Surveillance Event Detection , 2010, TRECVID.

[5]  Alexander Hauptmann,et al.  Informedia @ TRECVID2009: Analyzing Video Motions , 2009, TRECVID.

[6]  Maja Pantic,et al.  Spatiotemporal saliency for human action recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[7]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[9]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[10]  Ming-Hsuan Yang,et al.  Visual tracking with online Multiple Instance Learning , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Henry Medeiros,et al.  A parallel color-based particle filter for object tracking , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Ming Yang,et al.  Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor , 2009, ACM Multimedia.

[13]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[14]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[15]  Roberto Cipolla,et al.  Extracting Spatiotemporal Interest Points using Global Information , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.