Action Recognition Using Hierarchical Independent Subspace Analysis with Trajectory

Action recognition in videos is an important and challenging problem in computer vision. One of the most crucial aspects of a successful action recognition system is its feature extraction component. Stacked, convolutional Independent Subspace Analysis (SC-ISA), has the best result among unsupervised learning algorithms for action recognition in Hollywood 2 (53.3%) and Youtube (75.8%). However, its performance still lags behind the current state-of-the-art, which uses computer vision-based feature engineering extraction techniques, by about 10%. In this paper, we improve SC-ISA’s results by incorporating motion information into SC-ISA. By extracting blocks following motion trajectories in videos, we are able to reduce noise and increase the number of training samples without degrading the network’s performance when training and testing SC-ISA. We increase SC-ISA’s result by about 1%.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Antti Oulasvirta,et al.  Computer Vision – ECCV 2006 , 2006, Lecture Notes in Computer Science.

[3]  Shenghuo Zhu,et al.  Deep Learning of Invariant Features via Simulated Fixations in Video , 2012, NIPS.

[4]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[5]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Leszek Wojnar,et al.  Image Analysis , 1998 .

[8]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[9]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[10]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[11]  Thomas Deselaers,et al.  ClassCut for Unsupervised Class Segmentation , 2010, ECCV.

[12]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Graeme Mitchison,et al.  Removing Time Variation with the Anti-Hebbian Differential Synapse , 1991, Neural Computation.

[14]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[15]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Aapo Hyvärinen,et al.  Natural Image Statistics - A Probabilistic Approach to Early Computational Vision , 2009, Computational Imaging and Vision.

[17]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[18]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[20]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[21]  Jean-François Cardoso,et al.  Multidimensional independent component analysis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[22]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[23]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[24]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[25]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[27]  Heng Wang LEAR-INRIA submission for the THUMOS workshop , 2013 .

[28]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[29]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[30]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[31]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[32]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[33]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[34]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Teuvo Kohonen,et al.  Emergence of invariant-feature detectors in the adaptive-subspace self-organizing map , 1996, Biological Cybernetics.

[37]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[38]  David Elliott,et al.  In the Wild , 2010 .