Modeling temporal structure of complex actions using Bag-of-Sequencelets

We propose a new model for recognizing complex actions named Bag-of-Sequencelets.We represent a video as a sequence of primitive actions.We model a complex action as an ensemble of sub-sequences (sequencelets).We automatically learn sequencelets without any annotation about action structures.We achieve state-of-the-art results on the Olympic sports and UCF YouTube datasets. This paper proposes a new framework for modeling temporal structures of complex human actions. Inspired by the fact that a complex action is the temporally ordered composition of sub-actions, we develop a new model named Bag-of-Sequencelets (BoS). To construct a BoS model, a video is represented as a sequence of Primitive Actions (PAs). A PA is a representative motion pattern that constitutes actions and is learned in an unsupervised manner. Representing a video as a sequence of PAs preserves their temporal order. A sequencelet is an informative sub-sequence that describes the partial structure of actions while preserving temporal relations among PAs. In a BoS model, an action is modeled as an ensemble of sequencelets. We can use sequential pattern mining to automatically learn the sequencelet without any annotation or prior knowledge of action structure. Because the BoS model has both compositional and chronological properties, it can effectively model the structures of complex actions despite intra-class variations such as viewpoint change. Experimental results show the effectiveness of the BoS model in temporal structure modeling. Applied to the Olympic sports and UCF YouTube datasets, BoS achieves greater classification accuracy than state-of-the-art methods.

[1]  Nizar R. Mabroukeh,et al.  A taxonomy of sequential pattern mining algorithms , 2010, CSUR.

[2]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Filiberto Pla,et al.  Bag-of-words with aggregated temporal pair-wise word co-occurrence for human action recognition , 2014, Pattern Recognit. Lett..

[4]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[5]  Ricardo da Silva Torres,et al.  A signature-based bag of visual words method for image indexing and search , 2015, Pattern Recognit. Lett..

[6]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[7]  Alfredo Petrosino,et al.  Human activity modeling by spatio temporal textural appearance , 2013, Pattern Recognit. Lett..

[8]  Sourav S. Bhowmick,et al.  Sequential Pattern Mining: A Survey , 2003 .

[9]  Alexandros Iosifidis,et al.  Graph Embedded Extreme Learning Machine , 2016, IEEE Transactions on Cybernetics.

[10]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[11]  Yun Fu,et al.  Prediction of Human Activity by Discovering Temporal Sequence Patterns , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[13]  Limin Wang,et al.  Mining Motion Atoms and Phrases for Complex Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Subhransu Maji,et al.  Detecting People Using Mutually Consistent Poselet Activations , 2010, ECCV.

[15]  Yun Fu,et al.  Modeling Complex Temporal Composition of Actionlets for Activity Prediction , 2012, ECCV.

[16]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Limin Wang,et al.  Latent Hierarchical Model of Temporal Structure for Complex Activity Classification , 2014, IEEE Transactions on Image Processing.

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Ling Shao,et al.  Kernelized Multiview Projection for Robust Action Recognition , 2016, International Journal of Computer Vision.

[20]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[21]  Aaron F. Bobick,et al.  From Stochastic Grammar to Bayes Network: Probabilistic Parsing of Complex Activity , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Alan Fern,et al.  Probabilistic event logic for interval-based event recognition , 2011, CVPR 2011.

[23]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[24]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.

[26]  Radu Horaud,et al.  will be inserted by the editor ) Continuous Action Recognition Based on Sequence Alignment , 2017 .

[27]  Wen Gao,et al.  Mining Layered Grammar Rules for Action Recognition , 2011, International Journal of Computer Vision.

[28]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[29]  Alexandros Iosifidis,et al.  Class-Specific Reference Discriminant Analysis With Application in Human Behavior Analysis , 2015, IEEE Transactions on Human-Machine Systems.

[30]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[31]  Sebastian Nowozin,et al.  Discriminative Subsequence Mining for Action Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[32]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[33]  Ki-Sang Hong,et al.  Enhanced Sequence Matching for Action Recognition from 3D Skeletal Data , 2014, ACCV.

[34]  Savvas A. Chatzichristofis,et al.  Image moment invariants as local features for content based image retrieval using the Bag-of-Visual-Words model , 2015, Pattern Recognit. Lett..

[35]  Alexandros Iosifidis,et al.  Discriminant Bag of Words based representation for human action recognition , 2014, Pattern Recognit. Lett..

[36]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Radu Tudor Ionescu,et al.  PQ kernel: A rank correlation kernel for visual word histograms , 2015, Pattern Recognit. Lett..

[39]  Bingbing Ni,et al.  Motion Part Regularization: Improving action recognition via trajectory group selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.