Action-vectors: Unsupervised movement modeling for action recognition

Representation and modelling of movements play a significant role in recognising actions in unconstrained videos. However, explicit segmentation and labelling of movements are non-trivial because of the variability associated with actors, camera viewpoints, duration etc. Therefore, we propose to train a GMM with a large number of components termed as a universal movement model (UMM). This UMM is trained using motion boundary histograms (MBH) which capture the motion trajectories associated with the movements across all possible actions. For a particular action video, the MAP adapted mean vectors of the UMM are concatenated to form a fixed dimensional representation referred to as “super movement vector” (SMV). However, SMV is still high dimensional and hence, Baum-Welch statistics extracted from the UMM are used to arrive at a compact representation for each action video, which we refer to as an “action-vector”. It is shown that even without the use of class labels, action-vectors provide a more discriminatory representation of action classes translating to a 8 % relative improvement in classification accuracy for action-vectors based on MBH features over naïve MBH features on the UCF101 dataset. Furthermore, action-vectors projected with LDA achieve 93% accuracy on the UCF101 dataset which rivals state-of-the-art deep learning techniques.

[1]  Guillermo Sapiro,et al.  Sparse Modeling of Human Actions from Motion Imagery , 2012, International Journal of Computer Vision.

[2]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Giorgio Metta,et al.  Keep it simple and sparse: real-time action recognition , 2013, J. Mach. Learn. Res..

[8]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  A. A. Salah,et al.  Extreme Learning Machine for Large-Scale Action Recognition , 2014 .

[10]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Nuno Vasconcelos,et al.  VLAD3: Encoding Dynamics of Deep Features for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[15]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[16]  Rama Chellappa,et al.  Activity Modeling Using Event Probability Sequences , 2008, IEEE Transactions on Image Processing.

[17]  Yun Fu,et al.  Modeling Complex Temporal Composition of Actionlets for Activity Prediction , 2012, ECCV.

[18]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.