Rethinking Spatiotemporal Feature Learning For Video Understanding

In this paper we study 3D convolutional networks for video understanding tasks. Our starting point is the state-of-the-art I3D model, which "inflates" all the 2D filters of the Inception architecture to 3D. We first consider "deflating" the I3D model at various levels to understand the role of 3D convolutions. Interestingly, we found that 3D convolutions at the top layers of the network contribute more than 3D convolutions at the bottom layers, while also being computationally more efficient. This indicates that I3D is better at capturing high-level temporal patterns than low-level motion signals. We also consider replacing 3D convolutions with spatiotemporal-separable 3D convolutions (i.e., replacing convolution using a k * k * k filter with 1 * k * k followed by k * 1 * 1 filters); we show that such a model, which we call S3D, is 1.5x more computationally efficient (in terms of FLOPS) than I3D, and achieves better accuracy. Finally, we explore spatiotemporal feature gating on top of S3D. The resulting model, which we call S3D-G, outperforms the state-of-the-art I3D model by 3.5% accuracy on Kinetics and reduces the FLOPS by 34%. It also achieves a new state-of-the-art performance when transferred to other action classification (UCF-101 and HMDB-51) and detection (UCF-101 and JHMDB) datasets.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ramakant Nevatia,et al.  Large-scale web video event classification by use of Fisher Vectors , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[5]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[6]  Quoc V. Le,et al.  Swish: a Self-Gated Activation Function , 2017, 1710.05941.

[7]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[8]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[9]  Richard P. Wildes,et al.  Temporal Residual Networks for Dynamic Scene Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jonghyun Choi,et al.  ActionFlowNet: Learning Motion Representation for Action Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[15]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[16]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[18]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[19]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[20]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[21]  Kenji Doya,et al.  Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[22]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[25]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Suman Saha,et al.  AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[30]  Cordelia Schmid,et al.  Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[35]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[36]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[37]  Cordelia Schmid,et al.  Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[38]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[39]  Xiao Liu,et al.  Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification , 2017, ArXiv.

[40]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.

[41]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[43]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[46]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[48]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Bernhard Schölkopf,et al.  Seeing the Arrow of Time , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[51]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Thomas Brox,et al.  Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[56]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[58]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.