The Influence of Temporal Information on Human Action Recognition with Large Number of Classes

Human action recognition from video input has seen much interest over the last decade. In recent years, the trend is clearly towards action recognition in real-world, unconstrained conditions (i.e. not acted) with an ever growing number of action classes. Much of the work so far has used single frames or sequences of frames where each frame was treated individually. This paper investigates the contribution that temporal information can make to human action recognition in the context of a large number of action classes. The key contributions are: (i) We propose a complementary information channel to the Bag-of- Words framework that models the temporal occurrence of the local information in videos. (ii) We investigate the influence of sensible local information whose temporal occurrence is more vital than any local information. The experimental validation on action recognition datasets with the largest number of classes to date shows the effectiveness of the proposed approach.

[1]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[2]  Irfan A. Essa,et al.  Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Heng Wang LEAR-INRIA submission for the THUMOS workshop , 2013 .

[4]  Zicheng Liu,et al.  Expandable Data-Driven Graphical Modeling of Human Actions Based on Salient Postures , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Manuela M. Veloso,et al.  Conditional random fields for activity recognition , 2007, AAMAS '07.

[8]  Luc Van Gool,et al.  European conference on computer vision (ECCV) , 2006, eccv 2006.

[9]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[10]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Graham Coleman,et al.  Detection and explanation of anomalous activities: representing activities as bags of event n-grams , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Chung-Lin Huang,et al.  Semantic analysis of soccer video using dynamic Bayesian network , 2006, IEEE Transactions on Multimedia.

[13]  Lihi Zelnik-Manor,et al.  Incorporating temporal context in Bag-of-Words models , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[14]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[15]  Shih-Fu Chang,et al.  Structure analysis of soccer video with domain knowledge and hidden Markov models , 2004, Pattern Recognit. Lett..

[16]  Ramakant Nevatia,et al.  Video-based event recognition: activity representation and probabilistic recognition methods , 2004, Comput. Vis. Image Underst..

[17]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[18]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[19]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[21]  Martial Hebert,et al.  Representing Pairwise Spatial and Temporal Relations for Action Recognition , 2010, ECCV.

[22]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Sebastian Nowozin,et al.  Discriminative Subsequence Mining for Action Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[25]  Irfan A. Essa,et al.  A novel sequence representation for unsupervised analysis of human activities , 2009, Artif. Intell..

[26]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[28]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[29]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[30]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[31]  David A. McAllester,et al.  Object Detection with Grammar Models , 2011, NIPS.

[32]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[33]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[34]  Cordelia Schmid,et al.  A time series kernel for action recognition , 2011, BMVC.

[35]  Liang Wang,et al.  Recognizing Human Activities from Silhouettes: Motion Subspace and Factorial Discriminative Graphical Model , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[37]  Yi Yang,et al.  Space-Time Robust Representation for Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  R. Nevatia,et al.  Online, Real-time Tracking and Recognition of Human Actions , 2008, 2008 IEEE Workshop on Motion and video Computing.

[40]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[41]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[42]  William Brendel,et al.  Activities as Time Series of Human Postures , 2010, ECCV.

[43]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.