FrameExit: Conditional Early Exiting for Efficient Video Recognition

In this paper, we propose a conditional early exiting framework for efficient video recognition. While existing works focus on selecting a subset of salient frames to re-duce the computation costs, we propose to use a simple sampling strategy combined with conditional early exiting to enable efficient recognition. Our model automatically learns to process fewer frames for simpler videos and more frames for complex ones. To achieve this, we employ a cascade of gating modules to automatically determine the earliest point in processing where an inference is sufficiently reliable. We generate on-the-fly supervision signals to the gates to provide a dynamic trade-off between accuracy and computational cost. Our proposed model outperforms competing methods on three large-scale video benchmarks. In particular, on ActivityNet1.3 and mini-kinetics, we outperform the state-of-the-art efficient video recognition methods with 1.3× and 2.1 less GFLOPs, respectively. Addition-ally, our method sets× a new state of the art for efficient video understanding on the HVU benchmark.

[1]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yi Yang,et al.  Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification , 2018, IJCAI.

[5]  A. Oliva,et al.  AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition , 2021, ICLR.

[6]  Max Welling,et al.  Batch-shaping for learning conditional channel gated networks , 2019, ICLR.

[7]  Yang Li,et al.  GaterNet: Dynamic Filter Selection in Convolutional Neural Network via a Dedicated Global Gating Network , 2018, ArXiv.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Larry S. Davis,et al.  A Coarse-to-Fine Framework for Resource Efficient Video Recognition , 2019, International Journal of Computer Vision.

[10]  Luc Van Gool,et al.  Large Scale Holistic Video Understanding , 2019, ECCV.

[11]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[12]  Juan Carlos Niebles,et al.  RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition , 2020, ECCV.

[13]  Simone Calderara,et al.  Conditional Channel Gated Networks for Task-Aware Continual Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Larry S. Davis,et al.  AdaFrame: Adaptive Frame Selection for Fast Video Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jordi Torres,et al.  Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks , 2017, ICLR.

[18]  Larry S. Davis,et al.  BlockDrop: Dynamic Inference Paths in Residual Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Quanfu Fan,et al.  More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation , 2019, NeurIPS.

[20]  Lorenzo Torresani,et al.  SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Xin Wang,et al.  SkipNet: Learning Dynamic Routing in Convolutional Networks , 2017, ECCV.

[22]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[23]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[26]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[27]  Tinne Tuytelaars,et al.  Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Cees Snoek,et al.  Video Time: Properties, Encoders and Evaluation , 2018, BMVC.

[31]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[32]  Ahmet Gunduz,et al.  Resource Efficient 3D Convolutional Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[33]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[34]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Kilian Q. Weinberger,et al.  Multi-Scale Dense Networks for Resource Efficient Image Classification , 2017, ICLR.

[36]  Serge J. Belongie,et al.  Convolutional Networks with Adaptive Inference Graphs , 2017, International Journal of Computer Vision.

[37]  Mihir Jain,et al.  TimeGate: Conditional Gating of Segments in Long-range Activities , 2020, ArXiv.

[38]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Wenhao Wu,et al.  Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Kate Saenko,et al.  AR-Net: Adaptive Frame Resolution for Efficient Action Recognition , 2020, ECCV.

[42]  Li Zhang,et al.  Spatially Adaptive Computation Time for Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[46]  Limin Wang,et al.  Dynamic Sampling Networks for Efficient Action Recognition in Videos , 2020, IEEE Transactions on Image Processing.

[47]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[49]  Cheng-Zhong Xu,et al.  Dynamic Channel Pruning: Feature Boosting and Suppression , 2018, ICLR.

[50]  Jifeng Dai,et al.  Resolution Adaptive Networks for Efficient Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Yulan Guo,et al.  Learning Sparse Masks for Efficient Image Super-Resolution , 2020, ArXiv.

[52]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[53]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Tae-Hyun Oh,et al.  Listen to Look: Action Recognition by Previewing Audio , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.