MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
暂无分享,去创建一个
Christoph Feichtenhofer | Jitendra Malik | Karttikeya Mangalam | Chao-Yuan Wu | Bo Xiong | Yanghao Li | Haoqi Fan
[1] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.
[2] Cordelia Schmid,et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.
[3] Cewu Lu,et al. Asynchronous Interaction Aggregation for Action Detection , 2020, ECCV.
[4] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[5] Zheng Shou,et al. Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Matthijs Douze,et al. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[8] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Abhinav Gupta,et al. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Yong Man Ro,et al. Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[12] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[13] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Wei Wu,et al. STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[15] Heng Wang,et al. Interactive Prototype Learning for Egocentric Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[16] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[17] Jitendra Malik,et al. Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.
[18] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[19] Liangzhe Yuan,et al. MoViNets: Mobile Video Networks for Efficient Video Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Jason Weston,et al. Not All Memories are Created Equal: Learning to Forget by Expiring , 2021, ICML.
[21] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.
[22] Jiannan Wu,et al. Watch Only Once: An End-to-End Video Action Detection Framework , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[23] Gedas Bertasius,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[24] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[25] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[26] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.
[27] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[28] Menglong Zhu,et al. Mobile Video Object Detection with Temporally-Aware Feature Maps , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[29] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[30] Cees Snoek,et al. VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..
[31] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[32] Andrew Zisserman,et al. Massively Parallel Video Networks , 2018, ECCV.
[33] Serge J. Belongie,et al. Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.
[34] Limin Wang,et al. Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[36] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.
[37] Gunhee Kim,et al. A Memory Network Approach for Story-Based Temporal Summarization of 360° Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[38] Matthieu Cord,et al. Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] Cordelia Schmid,et al. Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.
[40] Jifeng Dai,et al. 1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask , 2020, ArXiv.
[41] Nenghai Yu,et al. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.
[43] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Christoph Feichtenhofer,et al. Improved Multiscale Vision Transformers for Classification and Detection , 2021, ArXiv.
[45] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[46] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).
[47] Yujie Wang,et al. Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[48] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[49] Junjie Yan,et al. Equalization Loss for Long-Tailed Object Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Andrea Vedaldi,et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers , 2021, NeurIPS.
[51] Kaiming He,et al. Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.
[53] Yue Cao,et al. Memory Enhanced Global-Local Aggregation for Video Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[54] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.
[55] Arnold W. M. Smeulders,et al. Timeception for Complex Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Fadime Sener,et al. Technical Report: Temporal Aggregate Representations , 2021, ArXiv.
[57] Thomas Brox,et al. ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.
[58] Tao Mei,et al. Recurrent Tubelet Proposal and Recognition Networks for Action Detection , 2018, ECCV.
[59] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[60] Giovanni Maria Farinella,et al. Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[61] Ling Shao,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.
[62] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[63] Philipp Krähenbühl,et al. Towards Long-Form Video Understanding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Jack W. Rae,et al. Do Transformers Need Deep Long-Range Memory? , 2020, ACL.
[65] Andrew Zisserman,et al. A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.
[66] Kaiming He,et al. Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[68] Kristen Grauman,et al. Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[69] Yann LeCun,et al. Convolutional Learning of Spatio-temporal Features , 2010, ECCV.
[70] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.
[71] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[72] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.
[73] Lin Sun,et al. Lattice Long Short-Term Memory for Human Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[74] Shuicheng Yan,et al. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.
[75] Yu Qiao,et al. Action Recognition with Stacked Fisher Vectors , 2014, ECCV.
[76] Chi Zhang,et al. Sparse Temporal Causal Convolution for Efficient Action Modeling , 2019, ACM Multimedia.
[77] Lorenzo Torresani,et al. SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[78] Heng Wang,et al. Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[79] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[80] Longhui Wei,et al. Visformer: The Vision-friendly Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[81] Alexander J. Smola,et al. Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[82] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.
[83] Giovanni Maria Farinella,et al. Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation , 2018, ECCV Workshops.
[84] Dima Damen,et al. The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[85] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[86] Rohit Girdhar,et al. Anticipative Video Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[87] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[88] Omri Bar,et al. Video Transformer Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
[89] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.