Experimenting Motion Relativity for Action Recognition with a Large Number of Classes

In this paper, we present our approach and experiments for human action recognition on UCF101 dataset. In our previous work [ 9], we have proposed motion relativity for video event detection. In this work, we mainly experiment the performance of motion relativity feature for recognizing a specific set of events, i.e. human actions. 1. Feature Extraction We submitted one run to the action recognition task, where three features are used: SIFT, STIP and ERMH (Expanded Relativity Motion Histogram of Bag-of-VisualWords) proposed in [ 9]. For SIFT descriptors, Difference of Gaussian (DoG) [ 1] and Hessian Affine [ 2] detectors are used to detect local interest points, and 128-dimension SIFT feature [ 1] is extracted to describe each local image patch. A visual vocabulary of 5000 visual words is generated by clustering the SIFT descriptors with k-means algorithm. Given an image, each detected keypoint is then mapped to the three nearest visual words to form the BoW histogram. For STIP descriptors, we directly make use of the features provided by UCF101 dataset [ 4], where 4000 words are used to calculate the histogram for each video. In our previous work [ 9, 10], ERMH-BoW was used for video event detection and achieved encouraging results. The motivation of ERMH-BoW feature is to employ motion relativity to describe the behaviors and interactions between different objects/scenes. Considering that object segmentation remains difficult in unconstrained videos, we employ visual words to capturewhat objects/scenes are present in the video (or event). The relative motion between different visual words are then computed to capture the interactions between objects/scenes. In the following, we first briefly recall the algorithm for extracting ERMH-BoW feature and then employ it for human action recognition. Given two visual wordsa andb, the relative motion histogram between them is calculated as