Localizing volumetric motion for action recognition in realistic videos

This paper presents a novel motion localization approach for recognizing actions and events in real videos. Examples include StandUp and Kiss in Hollywood movies. The challenge can be attributed to the large visual and motion variations imposed by realistic action poses. Previous works mainly focus on learning from descriptors of cuboids around space time interest points (STIP) to characterize actions. The size, shape and space-time position of cuboids are fixed without considering the underlying motion dynamics. This often results in large set of fragmentized cuboids which fail to capture long-term dynamic properties of realistic actions. This paper proposes the detection of spatio-temporal motion volumes (namely Volume of Interest, VOI) of scale and position adaptive to localize actions. First, motions are described as bags of point trajectories by tracking keypoints along the time dimension. VOIs are then adaptively extracted by clustering trajectory on the motion mainfold. The resulting VOIs, of varying scales and centering at arbitrary positions depending on motion dynamics, are eventually described by SIFT and 3D gradient features for action recognition. Comparing with fixed-size cuboids, VOI allows comprehensive modeling of long-term motion and shows better capability in capturing contextual information associated with motion dynamics. Experiments on a realistic Hollywood movie dataset show that the proposed approach can achieve 20\% relative improvement compared to the state-of-the-art STIP based algorithm.

[1]  R. Sukthankar,et al.  Space-Time Shapelets for Action Recognition , 2008, 2008 IEEE Workshop on Motion and video Computing.

[2]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[3]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Yongdong Zhang,et al.  Invariant visual patterns for video copy detection , 2008, 2008 19th International Conference on Pattern Recognition.

[5]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Mohan M. Trivedi,et al.  A Survey of Vision-Based Trajectory Learning and Analysis for Surveillance , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[9]  René Vidal,et al.  A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.

[11]  Chong-Wah Ngo,et al.  Video event detection using motion relativity and visual relatedness , 2008, ACM Multimedia.