Two-layer discriminative model for human activity recognition

Most of recent methods for action/activity recognition, usually based on static classifiers, have achieved improvements by integrating context of local interest point (IP) features such as spatiotemporal IPs by characterising their neighbourhood under different scales. In this study, the authors propose a new approach that explicitly models the sequential aspect of activities. First, a sliding window segmentation technique splits the video stream into overlapping short segments. Each window is characterised by a local bag of words of IPs encoded by motion information. A first-layer support vector machine provides for each window a vector of conditional class probabilities that summarises all discriminant information that is relevant for sequence recognition. The sequence of these stochastic vectors is then fed to a hidden conditional random field for inference at the sequence level. They also show how their approach can be naturally extended to the problem of conjoint segmentation and recognition of a sequence of action classes within a continuous video stream. They have tested their model on various human action and activity datasets and the obtained results compare favourably with current state of the art.

[1]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[3]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[4]  Hosein Hashemi,et al.  Fuzzy Clustering of Seismic Sequences: Segmentation of Time-Frequency Representations , 2012, IEEE Signal Processing Magazine.

[5]  Bo Gao,et al.  A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[6]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[7]  Ling Shao,et al.  Spatio-temporal steerable pyramid for human action recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[8]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Bernadette Dorizzi,et al.  A Combined SVM/HCRF Model for Activity Recognition based on STIPs Trajectories , 2013, ICPRAM.

[10]  Mathieu Barnachon,et al.  Ongoing human action recognition with motion capture , 2014, Pattern Recognit..

[11]  François Brémond,et al.  Contextual Statistics of Space-Time Ordered Features for Human Action Recognition , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[12]  Ling Shao,et al.  A Multigraph Representation for Improved Unsupervised/Semi-supervised Learning of Human Actions , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14]  Cordelia Schmid,et al.  Explicit Modeling of Human-Object Interactions in Realistic Videos , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Tsuhan Chen,et al.  Spatio-Temporal Phrases for Activity Recognition , 2012, ECCV.

[16]  Venu Govindaraju,et al.  Language-motivated approaches to action recognition , 2013, J. Mach. Learn. Res..

[17]  Yang Yi,et al.  Human action recognition with salient trajectories , 2013, Signal Process..

[18]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Hichem Sahbi,et al.  Mid-level features and spatio-temporal context for activity recognition , 2012, Pattern Recognit..

[22]  Slawomir Bak,et al.  Relative dense tracklets for human action recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[23]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[24]  Andrew Gilbert,et al.  Action Recognition Using Mined Hierarchical Compound Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Yunde Jia,et al.  Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[26]  Moritz Tenorth,et al.  The TUM Kitchen Data Set of everyday manipulation activities for motion tracking and action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[27]  Gabriella Sanniti di Baja,et al.  GRUNTS: Graph Representation for UNsupervised Temporal Segmentation , 2015, ICIAP.

[28]  K. R. Ramakrishnan,et al.  Hyper-Fisher Vectors for Action Recognition , 2015, ArXiv.

[29]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[30]  François Brémond,et al.  Recognizing Gestures by Learning Local Motion Signatures of HOG Descriptors , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Jessica K. Hodgins,et al.  Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[34]  Ling Shao,et al.  Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[37]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Adrien Chan-Hon-Tong,et al.  Simultaneous segmentation and classification of human actions in video streams using deeply optimized Hough transform , 2014, Pattern Recognit..

[39]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.