TokenLearner: Adaptive Space-Time Tokenization for Videos
暂无分享,去创建一个
A. Piergiovanni | M. Ryoo | A. Angelova | M. Dehghani | Anurag Arnab | Google Research | Mostafa Dehghani
[1] Anurag Arnab,et al. SCENIC: A JAX Library for Computer Vision Research and Beyond , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] A. Dosovitskiy,et al. MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.
[4] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[5] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[6] Matthew A. Brown,et al. MoViNets: Mobile Video Networks for Efficient Video Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[8] Pieter Abbeel,et al. Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[10] Michael S. Ryoo,et al. AssembleNet++: Assembling Modality Representations via Attention Connections , 2020, ECCV.
[11] Michael S. Ryoo,et al. AViD Dataset: Anonymized Videos from Diverse Countries , 2020, NeurIPS.
[12] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.
[13] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[14] Vladlen Koltun,et al. Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Michael S. Ryoo,et al. Evolving Losses for Unsupervised Video Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Juan Carlos Niebles,et al. Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Martin Jaggi,et al. On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.
[19] Jakob Uszkoreit,et al. Scaling Autoregressive Video Models , 2019, ICLR.
[20] Michael S. Ryoo,et al. AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.
[21] Chen Sun,et al. D3D: Distilled 3D Networks for Video Action Recognition , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[22] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[23] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[24] Ashish Vaswani,et al. Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.
[25] Luc Van Gool,et al. Holistic Large Scale Video Understanding , 2019, ArXiv.
[26] Kaiming He,et al. Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[28] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Michael S. Ryoo,et al. Evolving Space-Time Neural Architectures for Videos , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[30] Abhinav Gupta,et al. Videos as Space-Time Region Graphs , 2018, ECCV.
[31] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.
[32] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[33] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[34] Yutaka Satoh,et al. Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).
[35] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[36] Richard P. Wildes,et al. Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[38] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[40] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.
[41] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[43] Lorenzo Torresani,et al. C3D: Generic Features for Video Analysis , 2014, ArXiv.
[44] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[45] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.