Learning Spatiotemporal Features with 3D Convolutional Networks

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

[1]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[2]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[4]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[5]  Michal Irani,et al.  Detecting Irregularities in Images and in Video , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[7]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[11]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[12]  Matthai Philipose,et al.  Egocentric recognition of handled objects: Benchmark and analysis , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[13]  Kristen Grauman,et al.  Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental updates , 2009, CVPR.

[14]  Rama Chellappa,et al.  Moving vistas: Exploiting motion for describing scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[16]  H. Sebastian Seung,et al.  Boundary Learning by Optimization with Topological Constraints , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[18]  Joseph F. Murray,et al.  Convolutional Networks Can Learn to Generate Affinity Graphs for Image Segmentation , 2010, Neural Computation.

[19]  Fei-Fei Li,et al.  Online detection of unusual events in videos via dynamic sparse coding , 2011, CVPR 2011.

[20]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[21]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[22]  Junsong Yuan,et al.  Sparse reconstruction cost for abnormal event detection , 2011, CVPR 2011.

[23]  Tal Hassner,et al.  One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[24]  Tara N. Sainath,et al.  Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[27]  Tal Hassner,et al.  The Action Similarity Labeling Challenge , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Richard P. Wildes,et al.  Dynamic scene understanding: The role of orientation features in space and time in scene classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Tal Hassner,et al.  Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[32]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[34]  Matthieu Cord,et al.  Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36]  Lior Wolf,et al.  Evaluating New Variants of Motion Interchange Patterns , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[37]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Richard P. Wildes,et al.  Spacetime Forests with Complementary Features for Dynamic Scene Recognition , 2013, BMVC.

[39]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[41]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Jonathan Tompson,et al.  Learning Human Pose Estimation Features with Convolutional Networks , 2013, ICLR.

[43]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[44]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[45]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Vanja Josifovski,et al.  Up next: retrieval methods for large scale related video suggestion , 2014, KDD.

[47]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[49]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[50]  Trevor Darrell,et al.  Recognizing Image Style , 2013, BMVC.

[51]  Yu Qiao,et al.  Large Margin Dimensionality Reduction for Action Similarity Labeling , 2014, IEEE Signal Processing Letters.

[52]  Richard P. Wildes,et al.  Bags of Spacetime Energies for Dynamic Scene Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Jonathan Tompson,et al.  MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation , 2014, ACCV.

[54]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Lisa Anne Hendricks,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2015, CVPR.

[57]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[58]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[59]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[60]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .