Spatio-Temporal Gated Transformers for Efficient Video Processing

We focus on the problem of efficient video stream processing with fully transformerbased architectures. Recent advances brought by transformers for image-based tasks inspires the research interests of applying transformers for videos. Yet, when applying image-based transformer solutions to videos, the computation becomes inefficient due to the redundant information in adjacent video frames. An analysis of the computation cost of the video object detection framework DETR identifies the linear layers as the major computation bottleneck. Thus, we propose dynamic gating layers to conduct conditional computation. With the generated binary or ternary gates, it is possible to avoid the computation for the stable background tokens in the video frames. The effectiveness of the dynamic gating mechanism for transformers is validated by experimental results. For video object detection, the FLOPs could be reduced by 48.3% without a significant drop of accuracy.

[1]  L. Gool,et al.  Video Super-Resolution Transformer , 2021, ArXiv.

[2]  Carlos Riquelme,et al.  Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.

[3]  Taco S. Cohen,et al.  Skip-Convolutions for Efficient Video Processing , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Luc Van Gool,et al.  LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[5]  Jonathon Shlens,et al.  Scaling Local Self-Attention for Parameter Efficient Visual Backbones , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Omri Bar,et al.  Video Transformer Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[8]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  L. Leal-Taixé,et al.  TrackFormer: Multi-Object Tracking with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[11]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[13]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[14]  Luc Van Gool,et al.  DHP: Differentiable Meta Pruning via HyperNetworks , 2020, ECCV.

[15]  Yue Cao,et al.  Memory Enhanced Global-Local Aggregation for Video Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Luc Van Gool,et al.  Group Sparsity: The Hinge Between Filter Pruning and Decomposition for Network Compression , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  A. Yuille,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[18]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[19]  Tinne Tuytelaars,et al.  Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Luc Van Gool,et al.  Learning Filter Basis for Convolutional Neural Network Compression , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[22]  Xiangyu Zhang,et al.  MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Trevor Darrell,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[25]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[28]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[29]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ming-Hsuan Yang,et al.  UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking , 2015, Comput. Vis. Image Underst..

[32]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[33]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[34]  Jian Sun,et al.  Accelerating Very Deep Convolutional Networks for Classification and Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[36]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[37]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Wankou Yang,et al.  TransPose: Towards Explainable Human Pose Estimation by Transformer , 2020, ArXiv.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.