Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
暂无分享,去创建一个
[1] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..
[2] Ari S. Morcos,et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.
[3] J. Dean,et al. ST-MoE: Designing Stable and Transferable Sparse Expert Models , 2022, 2202.08906.
[4] Kishaan Jeeveswaran,et al. A Comprehensive Study of Vision Transformers on Dense Prediction Tasks , 2022, VISIGRAPP.
[5] Ming-Hsuan Yang,et al. Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text , 2021, ArXiv.
[6] Krzysztof Choromanski,et al. PolyViT: Co-training Vision Transformers on Images, Videos and Audio , 2021, Trans. Mach. Learn. Res..
[7] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[8] Zhe Gan,et al. UFO: A UniFied TransfOrmer for Vision-Language Representation Learning , 2021, ArXiv.
[9] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.
[11] Vinay Uday Prabhu,et al. Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.
[12] Zangwei Zheng,et al. Cross-token Modeling with Conditional Computation , 2021, 2109.02008.
[13] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.
[14] Carlos Riquelme,et al. Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.
[15] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Jason Weston,et al. Hash Layers For Large Sparse Models , 2021, NeurIPS.
[17] Aakanksha Chowdhery,et al. DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning , 2021, NeurIPS.
[18] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[19] Chen Liang,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.
[20] Naman Goyal,et al. BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.
[21] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[22] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[23] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..
[24] Emily Denton,et al. Characterising Bias in Compressed Models , 2020, ArXiv.
[25] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.
[26] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ACM Comput. Surv..
[27] Mark Collier,et al. Routing Networks with Co-training for Continual Learning , 2020, ArXiv.
[28] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[29] Ce Liu,et al. Supervised Contrastive Learning , 2020, NeurIPS.
[30] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[31] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[32] Jifeng Dai,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[33] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[34] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[35] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[36] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.
[37] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.
[38] Zhe Zhao,et al. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , 2018, KDD.
[39] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[40] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[41] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[42] Louis Bodmer. ACKNOWLEDGEMENTS , 2013, Journal of Biosciences.
[43] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.
[44] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[45] Yichao Ou,et al. Cover , 2006, Brain and Development.
[46] Thomas M. Cover,et al. Elements of Information Theory , 2005 .
[47] Quoc V. Le,et al. Combined Scaling for Open-Vocabulary Image Classification , 2022 .
[48] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[49] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[50] Steven Bird. NLTK: The Natural Language Toolkit , 2006, ACL.