论文信息 - Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts - 字舞流文

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.

Rodolphe Jenatton | C. Riquelme | N. Houlsby | Basil Mustafa | J. Puigcerver

[1] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[2] Ari S. Morcos,et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.

[3] J. Dean,et al. ST-MoE: Designing Stable and Transferable Sparse Expert Models , 2022, 2202.08906.

[4] Kishaan Jeeveswaran,et al. A Comprehensive Study of Vision Transformers on Dense Prediction Tasks , 2022, VISIGRAPP.

[5] Ming-Hsuan Yang,et al. Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text , 2021, ArXiv.

[6] Krzysztof Choromanski,et al. PolyViT: Co-training Vision Transformers on Images, Videos and Audio , 2021, Trans. Mach. Learn. Res..

[7] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[8] Zhe Gan,et al. UFO: A UniFied TransfOrmer for Vision-Language Representation Learning , 2021, ArXiv.

[9] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[11] Vinay Uday Prabhu,et al. Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[12] Zangwei Zheng,et al. Cross-token Modeling with Conditional Computation , 2021, 2109.02008.

[13] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[14] Carlos Riquelme,et al. Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.

[15] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Jason Weston,et al. Hash Layers For Large Sparse Models , 2021, NeurIPS.

[17] Aakanksha Chowdhery,et al. DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning , 2021, NeurIPS.

[18] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[19] Chen Liang,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[20] Naman Goyal,et al. BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.

[21] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[22] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[23] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[24] Emily Denton,et al. Characterising Bias in Compressed Models , 2020, ArXiv.

[25] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[26] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[27] Mark Collier,et al. Routing Networks with Co-training for Continual Learning , 2020, ArXiv.

[28] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[29] Ce Liu,et al. Supervised Contrastive Learning , 2020, NeurIPS.

[30] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[31] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[32] Jifeng Dai,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[33] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[34] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[35] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[36] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[37] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[38] Zhe Zhao,et al. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , 2018, KDD.

[39] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[40] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[41] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[42] Louis Bodmer. ACKNOWLEDGEMENTS , 2013, Journal of Biosciences.

[43] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[44] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45] Yichao Ou,et al. Cover , 2006, Brain and Development.

[46] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[47] Quoc V. Le,et al. Combined Scaling for Open-Vocabulary Image Classiﬁcation , 2022 .

[48] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[49] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[50] Steven Bird. NLTK: The Natural Language Toolkit , 2006, ACL.