论文信息 - Experimenting Motion Relativity for Action Recognition with a Large Number of Classes

Experimenting Motion Relativity for Action Recognition with a Large Number of Classes

In this paper, we present our approach and experiments for human action recognition on UCF101 dataset. In our previous work [ 9], we have proposed motion relativity for video event detection. In this work, we mainly experiment the performance of motion relativity feature for recognizing a specific set of events, i.e. human actions. 1. Feature Extraction We submitted one run to the action recognition task, where three features are used: SIFT, STIP and ERMH (Expanded Relativity Motion Histogram of Bag-of-VisualWords) proposed in [ 9]. For SIFT descriptors, Difference of Gaussian (DoG) [ 1] and Hessian Affine [ 2] detectors are used to detect local interest points, and 128-dimension SIFT feature [ 1] is extracted to describe each local image patch. A visual vocabulary of 5000 visual words is generated by clustering the SIFT descriptors with k-means algorithm. Given an image, each detected keypoint is then mapped to the three nearest visual words to form the BoW histogram. For STIP descriptors, we directly make use of the features provided by UCF101 dataset [ 4], where 4000 words are used to calculate the histogram for each video. In our previous work [ 9, 10], ERMH-BoW was used for video event detection and achieved encouraging results. The motivation of ERMH-BoW feature is to employ motion relativity to describe the behaviors and interactions between different objects/scenes. Considering that object segmentation remains difficult in unconstrained videos, we employ visual words to capturewhat objects/scenes are present in the video (or event). The relative motion between different visual words are then computed to capture the interactions between objects/scenes. In the following, we first briefly recall the algorithm for extracting ERMH-BoW feature and then employ it for human action recognition. Given two visual wordsa andb, the relative motion histogram between them is calculated as

[1] Chong-Wah Ngo,et al. Bag-of-visual-words expansion using visual relatedness for video indexing , 2008, SIGIR '08.

[2] Chong-Wah Ngo,et al. Video event detection using motion relativity and visual relatedness , 2008, ACM Multimedia.

[3] Chong-Wah Ngo,et al. Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[4] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[5] Dong Xu,et al. Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Matthijs C. Dorst. Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[7] W. Marsden. I and J , 2012 .

[8] Cor J. Veenman,et al. Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Bernard Merialdo,et al. Weighting informativeness of bag-of-visual-words by kernel optimization for video concept detection , 2010, VLS-MCMR '10.

[10] Cordelia Schmid,et al. Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[11] Chong-Wah Ngo,et al. Semantic Indexing and Multimedia Event Detection: ECNU at TRECVID 2012 , 2012, TRECVID.