Skip-Convolutions for Efficient Video Processing

We propose Skip-Convolutions to leverage the large amount of redundancies in video streams and save computations. Each video is represented as a series of changes across frames and network activations, denoted as residuals. We reformulate standard convolution to be efficiently computed on residual frames: each layer is coupled with a binary gate deciding whether a residual is important to the model prediction, e.g. foreground regions, or it can be safely skipped, e.g. background regions. These gates can either be implemented as an efficient network trained jointly with convolution kernels, or can simply skip the residuals based on their magnitude. Gating functions can also incorporate block-wise sparsity structures, as required for efficient implementation on hardware platforms. By replacing all convolutions with Skip-Convolutions in two state-of-the-art architectures, namely EfficientDet and HRNet, we reduce their computational cost consistently by a factor of 3 ∼ 4× for two different tasks, without any accuracy drop. Extensive comparisons with existing model compression, as well as image and video efficiency methods demonstrate that Skip-Convolutions set a new state-of-the-art by effectively exploiting the temporal redundancies in videos.

[1]  Menglong Zhu,et al.  Looking Fast and Slow: Memory-Guided Mobile Video Object Detection , 2019, ArXiv.

[2]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[3]  P. Alam,et al.  R , 1823, The Herodotus Encyclopedia.

[4]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[5]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[6]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[7]  Dahua Lin,et al.  Low-Latency Video Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Xin Wang,et al.  Accel: A Corrective Fusion Network for Efficient Semantic Segmentation on Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tinne Tuytelaars,et al.  Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[12]  Gary J. Sullivan,et al.  Overview of the High Efficiency Video Coding (HEVC) Standard , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Markus Nagel,et al.  Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks , 2019, ArXiv.

[15]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[16]  Juergen Gall,et al.  Pose for Action - Action for Pose , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[17]  Luc Van Gool,et al.  Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Deva Ramanan,et al.  N-best maximal decoders for part models , 2011, 2011 International Conference on Computer Vision.

[19]  P. Alam ‘L’ , 2021, Composites Engineering: An A–Z Guide.

[20]  Kilian Q. Weinberger,et al.  Multi-Scale Dense Networks for Resource Efficient Image Classification , 2017, ICLR.

[21]  Serge J. Belongie,et al.  Convolutional Networks with Adaptive Inference Graphs , 2017, International Journal of Computer Vision.

[22]  Larry S. Davis,et al.  A Coarse-to-Fine Framework for Resource Efficient Video Recognition , 2019, International Journal of Computer Vision.

[23]  Xiaoxiao Li,et al.  Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[25]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Li Zhang,et al.  Spatially Adaptive Computation Time for Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ming-Hsuan Yang,et al.  UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking , 2015, Comput. Vis. Image Underst..

[29]  Jordi Torres,et al.  Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks , 2017, ICLR.

[30]  Jiashi Feng,et al.  Dynamic Kernel Distillation for Efficient Pose Estimation in Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  A. Piergiovanni,et al.  Tiny Video Networks , 2019, Applied AI Letters.

[32]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Bin Yang,et al.  SBNet: Sparse Blocks Network for Fast Inference , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  W. H. Young On the Multiplication of Successions of Fourier Constants , 1912 .

[35]  Daniel Matolin,et al.  High-DR frame-free PWM imaging with asynchronous AER intensity encoding and focal-plane temporal redundancy suppression , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[36]  Maurice Weiler,et al.  A General Theory of Equivariant CNNs on Homogeneous Spaces , 2018, NeurIPS.

[37]  Jian Sun,et al.  Accelerating Very Deep Convolutional Networks for Classification and Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[39]  Guillaume-Alexandre Bilodeau,et al.  SpotNet: Self-Attention Multi-Task Network for Object Detection , 2020, 2020 17th Conference on Computer and Robot Vision (CRV).

[40]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Sander M. Bohte,et al.  Fast and Efficient Asynchronous Neural Computation with Adapting Spiking Neural Networks , 2016, ArXiv.

[42]  Cheng-Zhong Xu,et al.  Dynamic Channel Pruning: Feature Boosting and Suppression , 2018, ICLR.

[43]  Joelle Pineau,et al.  Conditional Computation in Neural Networks for faster models , 2015, ArXiv.

[44]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Kate Saenko,et al.  AR-Net: Adaptive Frame Resolution for Efficient Action Recognition , 2020, ECCV.

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[48]  Max Welling,et al.  Temporally Efficient Deep Learning with Spikes , 2018, ICLR.

[49]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Zhe L. Lin,et al.  Temporally Distributed Networks for Fast Video Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[53]  Wulfram Gerstner,et al.  SPIKING NEURON MODELS Single Neurons , Populations , Plasticity , 2002 .

[54]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[55]  Yi Yang,et al.  More is Less: A More Complicated Network with Less Inference Complexity , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[57]  Jianbo Liu,et al.  LSTM Pose Machines , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Risi Kondor,et al.  On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups , 2018, ICML.

[59]  Yichen Wei,et al.  Towards High Performance Video Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Max Welling,et al.  Batch-shaping for learning conditional channel gated networks , 2019, ICLR.

[61]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[62]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[63]  Yulan Guo,et al.  Learning Sparse Masks for Efficient Image Super-Resolution , 2020, ArXiv.

[64]  Trevor Darrell,et al.  Clockwork Convnets for Video Semantic Segmentation , 2016, ECCV Workshops.