VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
暂无分享,去创建一个
[1] Sheng Guo,et al. Cross-Architecture Self-supervised Video Representation Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Limin Wang,et al. MixFormer: End-to-End Tracking with Iterative Mixed Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Alan Yuille,et al. Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, ArXiv.
[4] Yu-Gang Jiang,et al. BEVT: BERT Pretraining of Video Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Han Hu,et al. SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.
[9] Lu Yuan,et al. PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers , 2021, ArXiv.
[10] Tao Kong,et al. iBOT: Image BERT Pre-Training with Online Tokenizer , 2021, ArXiv.
[11] Ping Luo,et al. Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning , 2021, ArXiv.
[12] L. Gool,et al. Vi2CLR: Video and Image for Visual Contrastive Learning of Representation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[13] M. Savvides,et al. Contrast and Order Representations for Video Self-supervised Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[14] Thomas Wolf,et al. VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning , 2021, ArXiv.
[15] Andrea Vedaldi,et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers , 2021, NeurIPS.
[16] Anima Anandkumar,et al. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.
[17] Kaiming He,et al. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[19] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[20] Pieter Abbeel,et al. VideoGPT: Video Generation using VQ-VAE and Transformers , 2021, ArXiv.
[21] Saining Xie,et al. An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[22] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[23] Huchuan Lu,et al. Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Xiaojie Jin,et al. DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.
[25] Matthew A. Brown,et al. MoViNets: Mobile Video Networks for Efficient Video Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Wei Liu,et al. VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.
[28] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[29] Kate Saenko,et al. Semi-Supervised Action Recognition with Temporal Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[31] Limin Wang,et al. TDN: Temporal Difference Networks for Efficient Action Recognition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Runhao Zeng,et al. RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning , 2020, AAAI.
[33] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[34] Serge J. Belongie,et al. Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Zheng Shou,et al. Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Tong Lu,et al. TAM: Temporal Adaptive Module for Video Recognition , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[37] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[38] Andrew Zisserman,et al. Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.
[39] Jianbo Jiao,et al. Self-supervised Video Representation Learning by Pace Prediction , 2020, ECCV.
[40] Andrew Zisserman,et al. Memory-augmented Dense Predictive Coding for Video Representation Learning , 2020, ECCV.
[41] David A. Ross,et al. Learning Video Representations from Textual Web Supervision , 2020, ArXiv.
[42] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.
[43] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.
[44] Bolei Zhou,et al. Video Representation Learning with Visual Tempo Consistency , 2020, ArXiv.
[45] Torsten Hoefler,et al. Augment Your Batch: Improving Generalization Through Instance Repetition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[47] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[48] Cewu Lu,et al. Asynchronous Interaction Aggregation for Action Detection , 2020, ECCV.
[49] William T. Freeman,et al. SpeedNet: Learning the Speediness in Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Geoffrey Zweig,et al. Multi-modal Self-Supervision from Generalized Data Transformations , 2020, ArXiv.
[52] Michael S. Ryoo,et al. Evolving Losses for Unsupervised Video Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[53] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[54] Limin Wang,et al. Learning Spatiotemporal Features via Video and Text Pair Discrimination , 2020, ArXiv.
[55] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[57] Feiyue Huang,et al. TEINet: Towards an Efficient Architecture for Video Recognition , 2019, AAAI.
[58] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[59] Heng Wang,et al. Video Modeling With Correlation Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[61] Yueting Zhuang,et al. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[62] Chong-Wah Ngo,et al. Learning Spatio-Temporal Representation With Local and Global Diffusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[64] Heng Wang,et al. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[65] Heng Wang,et al. Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[66] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[67] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[68] Limin Wang,et al. Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[69] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[70] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[71] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[72] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[73] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[74] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[75] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[76] Ming-Hsuan Yang,et al. Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[77] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[78] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[79] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[80] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[81] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[82] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[83] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.
[84] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[85] Viorica Patraucean,et al. Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.
[86] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.
[87] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[88] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[89] Nitish Srivastava. Unsupervised Learning of Visual Representations using Videos , 2015 .
[90] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[91] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[92] Dacheng Tao,et al. Slow Feature Analysis for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[93] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.
[94] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..
[95] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.