Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors

Hand-crafted and learning-based features are two main types of video representations in the field of video understanding. How to integrate their merits to design good descriptors has been the research hotspot recently. Motivated by TDD (Wang et al. 2015), we combine trajectory pooling method and 3D ConvNets (Tran et al. 2015) and put forward a novel multi-scale trajectory-pooled 3D convolutional descriptor (MTC3D) for action recognition in this paper. Specifically, we calculate multi-scale dense trajectories from the input video and perform trajectory pooling on feature maps of 3D CNN. The proposed descriptor has two advantages: 3D CNN has the ability to extract high-level semantic information from videos and multi-scale trajectory pooling method utilizes the temporal information of videos subtly. The experiments on the datasets of HMDB51 and UCF101 demonstrate that the proposed descriptor achieves state-of-the-art results.

[1]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[2]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[3]  Shengping Zhang,et al.  Trajectory-Pooled 3D Convolutional Descriptors for Action Recognition , 2017, PCM.

[4]  Giorgio Metta,et al.  Keep it simple and sparse: real-time action recognition , 2013, J. Mach. Learn. Res..

[5]  Yiannis Demiris,et al.  Hierarchical attentive multiple models for execution and recognition of actions , 2006, Robotics Auton. Syst..

[6]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[7]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[8]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[9]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[10]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[12]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[14]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[15]  Michal Irani,et al.  Detecting Irregularities in Images and in Video , 2005, ICCV.

[16]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[17]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[18]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[19]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[20]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[21]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Luc Van Gool,et al.  Deep Temporal Linear Encoding Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Hongxun Yao,et al.  Strategy for dynamic 3D depth data matching towards robust action retrieval , 2015, Neurocomputing.

[24]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[25]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[26]  Wenhui Li,et al.  Cross-view action recognition by cross-domain learning , 2016, Image Vis. Comput..

[27]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[28]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[31]  Yue Gao,et al.  Continuous Probability Distribution Prediction of Image Emotions via Multitask Shared Sparse Regression , 2017, IEEE Transactions on Multimedia.

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[34]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[35]  Mohan S. Kankanhalli,et al.  Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[37]  Yue Gao,et al.  Predicting Personalized Emotion Perceptions of Social Images , 2016, ACM Multimedia.

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[40]  Anupam Agrawal,et al.  Vision based hand gesture recognition for human computer interaction: a survey , 2012, Artificial Intelligence Review.

[41]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[42]  Xiangyu Wang,et al.  Logo information recognition in large-scale social media data , 2014, Multimedia Systems.

[43]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  IEEE conference on computer vision and pattern recognition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[46]  Yun Fu,et al.  Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition , 2010, ACCV.

[47]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).