暂无分享,去创建一个
Cordelia Schmid | Georg Heigold | Chen Sun | Mario Lucic | Anurag Arnab | Mostafa Dehghani | C. Schmid | Chen Sun | G. Heigold | Mario Lucic | M. Dehghani | Anurag Arnab | Mostafa Dehghani
[1] Heng Wang,et al. Video Modeling With Correlation Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[3] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[4] Lukasz Kaiser,et al. Rethinking Attention with Performers , 2020, ArXiv.
[5] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[6] Dima Damen,et al. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[7] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.
[8] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[9] Omri Bar,et al. Video Transformer Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
[10] Cordelia Schmid,et al. Unified Graph Structured Models for Video Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[11] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[12] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[13] Jakob Uszkoreit,et al. Scaling Autoregressive Video Models , 2019, ICLR.
[14] Kaiming He,et al. Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[16] Lin Sun,et al. Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[17] Xuhui Jia,et al. Global Self-Attention Networks for Image Recognition , 2020, ArXiv.
[18] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[19] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[21] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[22] Quoc V. Le,et al. Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[23] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.
[25] Chong-Wah Ngo,et al. Learning Spatio-Temporal Representation With Local and Global Diffusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Bin Kang,et al. TEA: Temporal Excitation and Aggregation for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Michael S. Ryoo,et al. AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.
[28] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.
[29] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Liu Yang,et al. Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.
[31] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[32] Kaiming He,et al. A Multigrid Method for Efficiently Training Video Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Deva Ramanan,et al. Attentional Pooling for Action Recognition , 2017, NIPS.
[34] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .
[35] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[37] Maxim Neumann,et al. AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification , 2020, ECCV.
[38] Stephen Lin,et al. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
[39] Ling Shao,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.
[40] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Tao Xiang,et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.
[43] Dan Xu,et al. Dynamic Graph Message Passing Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[45] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.
[46] Ashish Vaswani,et al. Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.
[47] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[48] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[50] Pieter Abbeel,et al. Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Chunhua Shen,et al. End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Jianfei Cai,et al. Scalable Vision Transformers with Hierarchical Pooling , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[53] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ArXiv.
[54] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[55] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[56] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[57] Wei Wu,et al. STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[58] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[59] Kaiming He,et al. Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.
[60] Quanfu Fan,et al. More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation , 2019, NeurIPS.
[61] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[62] Heng Wang,et al. Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[63] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[64] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[65] Gedas Bertasius,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[66] D. Damen,et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2020, International Journal of Computer Vision.
[67] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[68] Anurag Arnab,et al. SCENIC: A JAX Library for Computer Vision Research and Beyond , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[69] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.
[70] Dustin Tran,et al. Image Transformer , 2018, ICML.
[71] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[72] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[73] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[74] MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, ArXiv.
[75] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.
[76] Tim Salimans,et al. Axial Attention in Multidimensional Transformers , 2019, ArXiv.
[77] Cordelia Schmid,et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.
[78] Yunchao Wei,et al. CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[79] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[80] Shuicheng Yan,et al. A2-Nets: Double Attention Networks , 2018, NeurIPS.
[81] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.
[82] Dima Damen,et al. An Evaluation of Action Recognition Models on EPIC-Kitchens , 2019, ArXiv.
[83] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[84] Klaus Dietmayer,et al. Point Transformer , 2020, IEEE Access.