论文信息 - Informedia @ TRECVID2009: Analyzing Video Motions

Informedia @ TRECVID2009: Analyzing Video Motions

The Informedia team participated in the tasks of high-level feature extraction and event detection in surveillance video. This year, we especially put our focus on analyzing motions in videos. We developed a robust new descriptor called MoSIFT, which explicitly encodes appearance features together with motion information. For the high-level feature detection, we trained multi-modality classifiers which include traditional static features and MoSIFT. The experimental result shows that MoSIFT has solid performance on motion related concepts and is complementary to static features. For event detection, we trained event classifiers in sliding windows using a bag-of-video-word approach. To reduce the number of false alarms, we aggregated short positive windows to favor long segmentation and applied a cascade classifier approach. The performance shows dramatic improvement over last year on the event detection task. 1 MoSIFT This section presents our MoSIFT[7] algorithm to detect and describe spatio-temporal interest points. In part-based methods, there are three major steps: detecting interest points, constructing a feature descriptor, and building a classifier. Detecting interest points reduces the whole video from a volume of pixels to compact but descriptive interest points. Therefore, we desire to develop a detection method, which detects a sufficient number of interest points containing the necessary information to recognize a human action. The MoSIFT algorithm detects spatially distinctive interest points with substantial motion. We first apply the well-known SIFT algorithm to find visually distinctive components in the spatial domain and detect spatiotemporal interest points with (temporal) motion constraints. The motion constraint consists of a 'sufficient' amount of optical flow around the distinctive points. Details of our algorithm are described in the following sections.

Alexander Hauptmann | Ming-yu Chen | Huan Li

[1] George Tzanetakis,et al. MARSYAS: a framework for audio analysis , 1999, Organised Sound.

[2] P. Bartlett,et al. Probabilities for SV Machines , 2000 .

[3] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4] Takeo Kanade,et al. Object Detection Using the Statistics of Parts , 2004, International Journal of Computer Vision.

[5] Barbara Caputo,et al. Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[6] Martial Hebert,et al. Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7] Cees G. M. Snoek,et al. Early versus late fusion in semantic video analysis , 2005, MULTIMEDIA '05.

[8] Serge J. Belongie,et al. Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[9] Cordelia Schmid,et al. Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[10] Juan Carlos Niebles,et al. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[11] Hsuan-Tien Lin,et al. A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[12] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Luc Van Gool,et al. Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14] Alexander G. Hauptmann,et al. MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[15] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[16] Matthijs C. Dorst. Distinctive Image Features from Scale-Invariant Keypoints , 2011 .