Sparse Temporal Causal Convolution for Efficient Action Modeling

Recently, spatio-temporal convolutional networks have achieved prominent performance in action classification. However, debates on the importance of temporal information lead to the rethinking of these architectures. In this work, we propose to employ sparse temporal convolutional operations in networks for efficient action modeling. We demonstrate that the explicit temporal feature interactions can be largely reduced without any degradation. And towards better scalability, we use causal convolutions for temporal feature learning. Under causality constraints, we replenish the model with auxiliary self-supervised tasks, namely video prediction and frame order discrimination. Besides, a gradient based multi-task learning algorithm is introduced for guaranteeing the dominance of action recognition task. The proposed model matches or outperforms the state-of-the-art methods on Kinetics, Something-Something V2, UCF101 and HMDB51 datasets.

[1]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[5]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[7]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[9]  Zhao Chen,et al.  GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.

[10]  Li Fei-Fei,et al.  Dynamic Task Prioritization for Multitask Learning , 2018, ECCV.

[11]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[12]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Kilian Q. Weinberger,et al.  Memory-Efficient Implementation of DenseNets , 2017, ArXiv.

[15]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[16]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[17]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[18]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[20]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Andrew Zisserman,et al.  Massively Parallel Video Networks , 2018, ECCV.

[22]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[23]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Bernhard Schölkopf,et al.  Seeing the Arrow of Time , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[26]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[27]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[28]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[29]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[30]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[36]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[37]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[42]  Changsheng Xu,et al.  Watch, Think and Attend: End-to-End Video Classification via Dynamic Knowledge Evolution Modeling , 2018, ACM Multimedia.

[43]  Mohan S. Kankanhalli,et al.  Attention Transfer from Web Images for Video Recognition , 2017, ACM Multimedia.

[44]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[45]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[47]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[48]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Vladlen Koltun,et al.  Multi-Task Learning as Multi-Objective Optimization , 2018, NeurIPS.

[51]  Qi Tian,et al.  Enhancing Micro-video Understanding by Harnessing External Sounds , 2017, ACM Multimedia.

[52]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[54]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  Antonio Torralba,et al.  Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Yuke Li,et al.  Video Forecasting with Forward-Backward-Net: Delving Deeper into Spatiotemporal Consistency , 2018, ACM Multimedia.

[57]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[58]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[60]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[61]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[62]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[63]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[64]  Juan Carlos Niebles,et al.  What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[66]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[67]  Daniel Rueckert,et al.  Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[69]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.