The effectiveness of MAE pre-pretraining for billion-scale pretraining
暂无分享,去创建一个
Aaron B. Adcock | Ross B. Girshick | Kalyan Vasudev Alwala | Haoqi Fan | Piotr Doll'ar | Armand Joulin | Ishan Misra | Christoph Feichtenhofer | Rohit Girdhar | Mannat Singh | Quentin Duval | Vaibhav Aggarwal
[1] Michael G. Rabbat,et al. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Badr Youbi Idrissi,et al. ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations , 2022, ICLR.
[3] Ludwig Schmidt,et al. LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.
[4] Kalyan Vasudev Alwala,et al. OmniMAE: Single Model Masked Pretraining on Images and Videos , 2022, ArXiv.
[5] Jiangliu Wang,et al. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition , 2022, NeurIPS.
[6] Haoqi Fan,et al. Masked Autoencoders As Spatiotemporal Learners , 2022, NeurIPS.
[7] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..
[8] Michael G. Rabbat,et al. Masked Siamese Networks for Label-Efficient Learning , 2022, ECCV.
[9] Ross B. Girshick,et al. Exploring Plain Vision Transformer Backbones for Object Detection , 2022, ECCV.
[10] Serge J. Belongie,et al. Visual Prompt Tuning , 2022, ECCV.
[11] Limin Wang,et al. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.
[12] Jianfeng Gao,et al. Focal Modulation Networks , 2022, NeurIPS.
[13] H. Shum,et al. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , 2022, ICLR.
[14] Aaron B. Adcock,et al. Revisiting Weakly Supervised Pre-Training of Visual Perception Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Saining Xie,et al. SLIP: Self-supervision meets Language-Image Pre-training , 2021, ECCV.
[16] A. Yuille,et al. Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Marcus Rohrbach,et al. FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Li Dong,et al. Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Han Hu,et al. SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Behnam Neyshabur,et al. Exploring the Limits of Large Scale Pre-training , 2021, ICLR.
[23] Haibin Ling,et al. CBNet: A Composite Backbone Network Architecture for Object Detection , 2021, IEEE Transactions on Image Processing.
[24] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.
[25] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.
[26] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[28] Tao Kong,et al. iBOT: Image BERT Pre-Training with Online Tokenizer , 2021, ArXiv.
[29] Jun Liu,et al. LVIS Challenge Track Technical Report 1st Place Solution: Distribution Balanced and Boundary Refinement for Large Vocabulary Instance Segmentation , 2021, ArXiv.
[30] Jakob Uszkoreit,et al. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, Trans. Mach. Learn. Res..
[31] Lu Yuan,et al. Dynamic Head: Unifying Object Detection Heads with Attentions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[33] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[34] Lihi Zelnik-Manor,et al. ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.
[35] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[36] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[37] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[38] B. Recht,et al. Do Image Classifiers Generalize Across Time? , 2019, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] Judy Hoffman,et al. TIDE: A General Toolbox for Identifying Object Detection Errors , 2020, ECCV.
[40] Benjamin Recht,et al. Measuring Robustness to Natural Distribution Shifts in Image Classification , 2020, NeurIPS.
[41] Julien Mairal,et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.
[42] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.
[43] Xiaohua Zhai,et al. Are we done with ImageNet? , 2020, ArXiv.
[44] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[45] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[46] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.
[47] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[48] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Myle Ott,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.
[50] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[51] Yi Yang,et al. Random Erasing Data Augmentation , 2017, AAAI.
[52] Jian Sun,et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[53] Laurens van der Maaten,et al. Does Object Recognition Work for Everyone? , 2019, CVPR Workshops.
[54] Ross B. Girshick,et al. LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[56] Benjamin Recht,et al. Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.
[57] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.
[58] Quoc V. Le,et al. Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[60] Boris Katz,et al. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.
[61] Kaiming He,et al. Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.
[62] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[63] Yang Song,et al. The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[64] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[65] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[66] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[68] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[69] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.
[70] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[71] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.
[72] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[73] Matthieu Guillaumin,et al. Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.
[74] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[75] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[76] Stefan Carlsson,et al. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.
[77] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[78] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.
[79] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[80] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..
[81] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[82] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.
[83] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .