Ordered trajectories for human action recognition with large number of classes

Recently, a video representation based on dense trajectories has been shown to outperform other human action recognition methods on several benchmark datasets. The trajectories capture the motion characteristics of different moving objects in space and temporal dimensions. In dense trajectories, points are sampled at uniform intervals in space and time and then tracked using a dense optical flow field over a fixed length of L frames (optimally 15) spread overlapping over the entire video. However, among these base (dense) trajectories, a few may continue for longer than duration L, capturing motion characteristics of objects that may be more valuable than the information from the base trajectories. Thus, we propose a technique that searches for trajectories with a longer duration and refer to these as 'ordered trajectories'. Experimental results show that ordered trajectories perform much better than the base trajectories, both standalone and when combined. Moreover, the uniform sampling of dense trajectories does not discriminate objects of interest from the background or other objects. Consequently, a lot of information is accumulated, which actually may not be useful. This can especially escalate when there is more data due to an increase in the number of action classes. We observe that our proposed trajectories remove some background clutter, too. We use a Bag-of-Words framework to conduct experiments on the benchmark HMDB51, UCF50 and UCF101 datasets containing the largest number of action classes to date. Further, we also evaluate three state-of-the art feature encoding techniques to study their performance on a common platform. A technique that captures information of objects with longer duration.A feature selection like approach that delivers better performance than several trajectory variants.Removal of a large number of trajectories related to background noise.We apply our technique on action datasets HMDB51, UCF50 and UCF101 containing largest number of classes till date.

[1]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[2]  Roland Göcke,et al.  Dense body part trajectories for human action recognition , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[3]  Yu Qiao,et al.  Exploring Motion Boundary based Sampling and Spatial-Temporal Context Descriptors for Action Recognition , 2013, BMVC.

[4]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[6]  Ramakant Nevatia,et al.  Detection and Tracking of Multiple, Partially Occluded Humans by Bayesian Combination of Edgelet based Part Detectors , 2007, International Journal of Computer Vision.

[7]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[8]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[9]  Michael Dorr,et al.  Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements , 2012, ECCV.

[10]  Stuart J. Russell,et al.  Object Identi cation : A Bayesian Analysiswith Application to Tra c Surveillance 1 , 1998 .

[11]  Ivor W. Tsang,et al.  Visual Event Recognition in Videos by Learning from Web Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[13]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[14]  R. Goecke,et al.  Combined Ordered and Improved Trajectories for Large Scale Human Action Recognition , 2013 .

[15]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Tal Hassner,et al.  Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[17]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[18]  A. G. Amitha Perera,et al.  Multi-Object Tracking Through Simultaneous Long Occlusions and Split-Merge Conditions , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[20]  Ivan Laptev,et al.  Improving bag-of-features action recognition with non-local cues , 2010, BMVC.

[21]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[22]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[24]  Roland Göcke,et al.  Ordered Trajectories for Large Scale Human Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[25]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[26]  Shihong Lao,et al.  Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Feng Shi,et al.  Sampling Strategies for Real-Time Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Thomas B. Moeslund,et al.  Selective spatio-temporal interest points , 2012, Comput. Vis. Image Underst..

[30]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[31]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[32]  Chris Stauffer,et al.  Estimating Tracking Sources and Sinks , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[33]  Krystian Mikolajczyk,et al.  Feature Tracking and Motion Compensation for Action Recognition , 2008, BMVC.

[34]  Anthony Hoogs,et al.  Real-time multi-target tracking at 210 megapixels/second in Wide Area Motion Imagery , 2014, IEEE Winter Conference on Applications of Computer Vision.

[35]  Ivor W. Tsang,et al.  Visual Event Recognition in Videos by Learning from Web Data , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Yihong Gong,et al.  Multi-target tracking by learning local-to-global trajectory models , 2015, Pattern Recognit..

[37]  Martial Hebert,et al.  Representing Pairwise Spatial and Temporal Relations for Action Recognition , 2010, ECCV.

[38]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[39]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[40]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[41]  Heng Wang LEAR-INRIA submission for the THUMOS workshop , 2013 .

[42]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[43]  Ling Shao,et al.  Learning Discriminative Key Poses for Action Recognition , 2013, IEEE Transactions on Cybernetics.

[44]  Ahmed M. Elgammal,et al.  Human activity recognition from frame’s spatiotemporal representation , 2008, 2008 19th International Conference on Pattern Recognition.

[45]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  Rui Zhang,et al.  A Generalized Pyramid Matching Kernel for Human Action Recognition in Realistic Videos , 2013, Sensors.

[47]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[48]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[49]  Shihong Lao,et al.  Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses , 2009, CVPR.

[50]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[51]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[53]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[54]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  A. G. Amitha Perera,et al.  A unified framework for tracking through occlusions and across sensor gaps , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[56]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[57]  Wei Huang,et al.  Detection and tracking of multiple moving objects in video , 2007, VISAPP.

[58]  T D Albright,et al.  Visual motion perception. , 1995, Proceedings of the National Academy of Sciences of the United States of America.