论文信息 - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets/tasks. However, it remains a challenge to evaluate the transferablity of these foundation models due to the lack of easy-to-use toolkits for fair bench-marking. To tackle this, we build E LEVATER 1 , the ﬁrst benchmark to compare and evaluate pre-trained language-augmented visual models. Several highlights include: ( i ) Datasets. As downstream evaluation suites, it consists of 20 image classiﬁcation datasets and 35 object detection datasets, each of which is augmented with external knowledge. ( ii ) Toolkit. An automatic hyper-parameter tuning toolkit is developed to ensure the fairness in model adaption. To leverage the full power of language-augmented visual models, novel language-aware initialization methods are proposed to signiﬁcantly improve the adaption performance. ( iii ) Metrics. A variety of evaluation metrics are used, including sample-efﬁciency (zero-shot and few-shot) and parameter-efﬁciency (linear probing and full model ﬁne-tuning). We will release our toolkit and evaluation platforms for the research community.

[1] Trevor Darrell,et al. K-LITE: Learning Transferable Visual Models with External Knowledge , 2022, NeurIPS.

[2] Jianfeng Gao,et al. Unified Contrastive Learning in Image-Text-Label Space , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Serge J. Belongie,et al. Visual Prompt Tuning , 2022, ECCV.

[4] Chen Change Loy,et al. Conditional Prompt Learning for Vision-Language Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Ping Luo,et al. Context Autoencoder for Self-Supervised Representation Learning , 2022, ArXiv.

[6] Siva Reddy,et al. IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages , 2022, ICML.

[7] Saining Xie,et al. SLIP: Self-supervision meets Language-Image Pre-training , 2021, ECCV.

[8] Lu Yuan,et al. RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Mohit Bansal,et al. VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Liunian Harold Li,et al. Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Zhenguo Li,et al. FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[13] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[14] Junjie Yan,et al. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm , 2021, ICLR.

[15] Hiroaki Hayashi,et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[16] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[17] Peng Gao,et al. CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..

[18] Maosong Sun,et al. CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models , 2021, ArXiv.

[19] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[20] Zhe Gan,et al. VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.

[21] Lu Yuan,et al. Dynamic Head: Unifying Object Detection Heads with Attentions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Frank Hutter,et al. DEHB: Evolutionary Hyberband for Scalable, Robust and Efficient Hyperparameter Optimization , 2021, IJCAI.

[23] Saining Xie,et al. An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[25] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[27] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[28] Shih-Fu Chang,et al. Open-Vocabulary Object Detection Using Captions , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[30] Nicola De Cao,et al. KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[31] David Lopez-Paz,et al. In Search of Lost Domain Generalization , 2020, ICLR.

[32] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] Xiuye Gu,et al. Zero-Shot Detection via Vision and Language Knowledge Distillation , 2021, ArXiv.

[34] Percy Liang,et al. Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[35] Mohit Bansal,et al. Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision , 2020, EMNLP.

[36] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[37] Douwe Kiela,et al. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.

[38] Trevor Darrell,et al. Frustratingly Simple Few-Shot Object Detection , 2020, ICML.

[39] Naman Jain,et al. PlantDoc: A Dataset for Visual Plant Disease Detection , 2019, COMAD/CODS.

[40] André Susano Pinto,et al. A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark , 2019, 1910.04867.

[41] Jian Sun,et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42] Thomas B. Moeslund,et al. Detection of Marine Animals in a New Underwater Dataset with Varying Visibility , 2019, CVPR Workshops.

[43] Ross B. Girshick,et al. LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[46] Nuno Vasconcelos,et al. Towards Universal Object Detection by Domain Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[48] Bo Wang,et al. Moment Matching for Multi-Source Domain Adaptation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49] Andreas Dengel,et al. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[50] Christoph H. Lampert,et al. Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52] Aaron Klein,et al. BOHB: Robust and Efficient Hyperparameter Optimization at Scale , 2018, ICML.

[53] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[54] Max Welling,et al. Rotation Equivariant CNNs for Digital Pathology , 2018, MICCAI.

[55] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[56] Zicheng Liu,et al. Reinforced Temporal Attention and Split-Rate Transfer for Depth-Based Person Re-identification , 2017, ECCV.

[57] Xiaoqiang Lu,et al. Remote Sensing Image Scene Classification: Benchmark and State of the Art , 2017, Proceedings of the IEEE.

[58] Allan Jabri,et al. Learning Visual N-Grams from Web Data , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[59] Yanwei Fu,et al. Semi-supervised Vocabulary-Informed Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[61] Stefan Lee,et al. Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[62] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[63] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[65] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[66] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.

[67] Matthieu Guillaumin,et al. Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[68] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[69] Christoph H. Lampert,et al. Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70] Trevor Darrell,et al. Recognizing Image Style , 2013, BMVC.

[71] Iasonas Kokkinos,et al. Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[72] Jonathan Krause,et al. 3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[73] Jannik Fritsch,et al. A new performance measure and evaluation benchmark for road detection algorithms , 2013, 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013).

[74] Andreas Geiger,et al. Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[75] Subhransu Maji,et al. Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[76] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[77] Iryna Gurevych,et al. Wiktionary: a new rival for expert-built lexicons? Exploring the possibilities of collaborative lexicography , 2012 .

[78] L. Deng,et al. The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[79] C. V. Jawahar,et al. Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[80] James Hays,et al. SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[81] Johannes Stallkamp,et al. The German Traffic Sign Recognition Benchmark: A multi-class classification competition , 2011, The 2011 International Joint Conference on Neural Networks.

[82] Pietro Perona,et al. The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[83] Bernt Schiele,et al. Evaluating knowledge transfer and zero-shot learning in a large-scale setting , 2011, CVPR 2011.

[84] Richard Szeliski,et al. Computer Vision - Algorithms and Applications , 2011, Texts in Computer Science.

[85] Andrew Y. Ng,et al. Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[86] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[87] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[88] Christoph H. Lampert,et al. Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[89] Ali Farhadi,et al. Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[90] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[91] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[92] Jianguo Zhang,et al. The PASCAL Visual Object Classes Challenge , 2006 .

[93] Pietro Perona,et al. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[94] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.