What Can Simple Arithmetic Operations Do for Temporal Modeling?
暂无分享,去创建一个
Wanli Ouyang | Wenhao Wu | Jingdong Wang | Chang Xu | Zhun Sun | Yuxin Song
[1] Ledell Yu Wu,et al. EVA-CLIP: Improved Training Techniques for CLIP at Scale , 2023, ArXiv.
[2] Wanli Ouyang,et al. UATVR: Uncertainty-Adaptive Text-Video Retrieval , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).
[3] Haipeng Luo,et al. Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Haipeng Luo,et al. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Kunchang Li,et al. UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer , 2022, ArXiv.
[6] Ludwig Schmidt,et al. LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.
[7] Gerard de Melo,et al. Frozen CLIP Models are Efficient Video Learners , 2022, ECCV.
[8] Haibin Ling,et al. Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.
[9] Jungong Han,et al. Temporal Saliency Query Network for Efficient Video Recognition , 2022, ECCV.
[10] Wanli Ouyang,et al. NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition , 2022, ECCV.
[11] Wanli Ouyang,et al. Revisiting Classifier: Transferring Vision-Language Models for Video Recognition , 2022, AAAI.
[12] Hongsheng Li,et al. ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition , 2022, NeurIPS.
[13] Haichao Shi,et al. TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization , 2022, Machine Intelligence Research.
[14] Yi Yang,et al. CenterCLIP: Token Clustering for Efficient Text-Video Retrieval , 2022, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
[15] C. Schmid,et al. Multiview Transformers for Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Andrew M. Dai,et al. Co-training Transformer with Videos and Images Improves Action Recognition , 2021, ArXiv.
[17] Weidi Xie,et al. Prompting Visual-Language Models for Efficient Video Understanding , 2021, ECCV.
[18] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[19] Mengmeng Wang,et al. ActionCLIP: A New Paradigm for Video Action Recognition , 2021, ArXiv.
[20] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] A. Piergiovanni,et al. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? , 2021, ArXiv.
[22] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Wenhao Wu,et al. DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning , 2021, ACM Multimedia.
[24] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[25] Lihi Zelnik-Manor,et al. ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.
[26] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[27] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[28] Enhua Wu,et al. Transformer in Transformer , 2021, NeurIPS.
[29] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[30] Suha Kwak,et al. Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[32] Omri Bar,et al. Video Transformer Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
[33] Limin Wang,et al. TDN: Temporal Difference Networks for Efficient Action Recognition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Chuang Gan,et al. MVFNet: Multi-View Fusion Network for Efficient Video Recognition , 2020, AAAI.
[35] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[36] Suha Kwak,et al. MotionSqueeze: Neural Motion Feature Learning for Video Understanding , 2020, ECCV.
[37] Yali Wang,et al. SmallBigNet: Integrating Core and Contextual Views for Video Classification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Tong Lu,et al. TAM: Temporal Adaptive Module for Video Recognition , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Bin Kang,et al. TEA: Temporal Excitation and Aggregation for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Weilin Huang,et al. V4D: 4D Convolutional Neural Networks for Video-level Representation Learning , 2020, ICLR.
[42] Shilei Wen,et al. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[43] K. Grauman,et al. Listen to Look: Action Recognition by Previewing Audio , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[44] O. Lanz,et al. Gate-Shift Networks for Video Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Feiyue Huang,et al. TEINet: Towards an Efficient Architecture for Video Recognition , 2019, AAAI.
[46] Alan Yuille,et al. Grouped Spatial-Temporal Aggregation for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[47] Wei Wu,et al. STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[48] Wenhao Wu,et al. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[49] Heng Wang,et al. Video Modeling With Correlation Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Heng Wang,et al. Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[51] Kaiming He,et al. Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[53] Larry S. Davis,et al. AdaFrame: Adaptive Frame Selection for Fast Video Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[54] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[55] Xiao Liu,et al. StNet: Local and Global Spatial-Temporal Modeling for Action Recognition , 2018, AAAI.
[56] Thomas Brox,et al. ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.
[57] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.
[58] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[59] Wei Zhang,et al. Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[60] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.
[61] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[62] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[63] Xiao Liu,et al. Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding , 2017, ArXiv.
[64] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[65] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[66] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[67] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[68] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[69] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.
[70] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.
[71] Bowen Zhang,et al. Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[72] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[73] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[74] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[75] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[76] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[77] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[78] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[79] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[80] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[81] Heydi Mendez Vazquez,et al. A new image division for LBP method to improve face recognition under varying lighting conditions , 2008, 2008 19th International Conference on Pattern Recognition.
[82] B. V. K. Vijaya Kumar,et al. Correlation Pattern Recognition , 2002 .
[83] Hongsheng Li,et al. Parameter-Efficient Image-to-Video Transfer Learning , 2022 .
[84] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[85] Konstantinos G. Derpanis,et al. Integral image-based representations , 2007 .