SlowFast Networks for Video Recognition

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast.

[1]  P. Lennie,et al.  Spatial and temporal contrast sensitivities of neurones in lateral geniculate nucleus of macaque. , 1984, The Journal of physiology.

[2]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[3]  D. Hubel,et al.  Segregation of form, color, movement, and depth: anatomy, physiology, and perception. , 1988, Science.

[4]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.

[5]  D. Ruderman The statistics of natural images , 1994 .

[6]  D. V. Essen,et al.  Neural mechanisms of form and motion processing in the primate visual system , 1994, Neuron.

[7]  David Mumford,et al.  Statistics of natural images and models , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[8]  Edward H. Adelson,et al.  Motion illusions as optimal percepts , 2002, Nature Neuroscience.

[9]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[10]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[11]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[15]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[23]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[30]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[32]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[34]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.

[35]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[36]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[37]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ali Farhadi,et al.  Asynchronous Temporal Fields for Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Fu Li,et al.  Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition , 2018, ArXiv.

[42]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Gang Yu,et al.  Human Centric Spatio-Temporal Action Localization , 2018 .

[44]  Bernard Ghanem,et al.  The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary , 2018, ArXiv.

[45]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[47]  Cordelia Schmid,et al.  Actor-Centric Relation Network , 2018, ECCV.

[48]  Andrew Zisserman,et al.  A Better Baseline for AVA , 2018, ArXiv.

[49]  Luc Van Gool,et al.  Spatio-Temporal Channel Correlation Networks for Action Classification , 2018, ECCV.

[50]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[53]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[55]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[56]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Andrew Zisserman,et al.  A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[59]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.