Representing Pairwise Spatial and Temporal Relations for Action Recognition

The popular bag-of-words paradigm for action recognition tasks is based on building histograms of quantized features, typically at the cost of discarding all information about relationships between them. However, although the beneficial nature of including these relationships seems obvious, in practice finding good representations for feature relationships in video is difficult. We propose a simple and computationally efficient method for expressing pairwise relationships between quantized features that combines the power of discriminative representations with key aspects of Naive Bayes. We demonstrate how our technique can augment both appearance- and motion-based features, and that it significantly improves performance on both types of features.

[1]  David C. Hogg,et al.  Learning the Distribution of Object Trajectories for Event Recognition , 1995, BMVC.

[2]  Tim J. Ellis,et al.  Spatial and Probabilistic Modelling of Pedestrian Behaviour , 2002, BMVC.

[3]  Gustavo Carneiro,et al.  Sparse Flexible Models of Local Features , 2006, ECCV.

[4]  Axel Pinz,et al.  Computer Vision – ECCV 2006 , 2006, Lecture Notes in Computer Science.

[5]  Daniel P. Huttenlocher,et al.  Weakly Supervised Learning of Part-Based Spatial Models for Visual Object Recognition , 2006, ECCV.

[6]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[7]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Martial Hebert,et al.  Beyond Local Appearance: Category Recognition from Pairwise Interactions of Simple Features , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Hao Jiang,et al.  Finding Actions Using Shape Flows , 2008, ECCV.

[10]  Andrew J. Davison,et al.  Active Matching , 2008, ECCV.

[11]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Andrew Gilbert,et al.  Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-temporal Corners , 2008, ECCV.

[13]  Patrick Pérez,et al.  Cross-View Action Recognition from Temporal Self-similarities , 2008, ECCV.

[14]  Liang-Tien Chia,et al.  Motion Context: A New Representation for Human Action Recognition , 2008, ECCV.

[15]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[16]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Jitendra Malik,et al.  Object detection using a max-margin Hough transform , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Martial Hebert,et al.  Trajectons: Action recognition through the motion analysis of tracked features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[21]  Andrew Gilbert,et al.  Fast realistic multi-action recognition using mined dense spatio-temporal features , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[23]  David Elliott,et al.  In the Wild , 2010 .

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.