A Multigrid Method for Efficiently Training Video Models

Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training has used a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the optimal shape? High resolution models perform well, but train slowly. Low resolution models train faster, but are less accurate. Inspired by multigrid methods in numerical optimization, we propose to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. The different shapes arise from resampling the training data on multiple sampling grids. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU). As an illustrative example, the proposed multigrid method trains a ResNet-50 SlowFast network 4.5x faster (wall-clock time, same hardware) while also improving accuracy (+0.8% absolute) on Kinetics-400 compared to baseline training. Code is available online.

[1]  Arnold W. M. Smeulders,et al.  Timeception for Complex Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[5]  Jinchao Xu,et al.  MgNet: A unified framework of multigrid and convolutional neural network , 2019, Science China Mathematics.

[6]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Michael S. Ryoo,et al.  Evolving Space-Time Neural Architectures for Videos , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Luca Maria Gambardella,et al.  High-Performance Neural Networks for Visual Object Classification , 2011, ArXiv.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[18]  Andrew Zisserman,et al.  Massively Parallel Video Networks , 2018, ECCV.

[19]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[20]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[21]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[23]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[24]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[25]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[26]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[27]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Song Han,et al.  Temporal Shift Module for Efficient Video Understanding , 2018, ArXiv.

[30]  Alan L. Yuille,et al.  Neural Rejuvenation: Improving Deep Network Training by Enhancing Computational Resource Utilization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Shuicheng Yan,et al.  Multi-Fiber Networks for Video Recognition , 2018, ECCV.

[32]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[33]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[34]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Lorenzo Torresani,et al.  SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Davide Modolo,et al.  Action Recognition With Spatial-Temporal Discriminative Filter Banks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Nojun Kwak,et al.  Motion Feature Network: Fixed Motion Filter for Action Recognition , 2018, ECCV.

[38]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[40]  Matthijs Douze,et al.  Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[41]  Alan Yuille,et al.  Grouped Spatial-Temporal Aggregation for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[44]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[45]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[46]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[49]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).