Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
暂无分享,去创建一个
[1] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, ArXiv.
[2] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.
[3] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[4] A. Schwing,et al. Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Xizhou Zhu,et al. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Anima Anandkumar,et al. Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[8] Kurt Keutzer,et al. How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.
[9] Zhilin Yang,et al. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks , 2021, ArXiv.
[10] Ankur Bapna,et al. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.
[11] Alexandre Muzio,et al. Scalable and Efficient MoE Training for Multitask Multilingual Models , 2021, ArXiv.
[12] Jakob Uszkoreit,et al. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, Trans. Mach. Learn. Res..
[13] Carlos Riquelme,et al. Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.
[14] Mingxuan Wang,et al. Learning Language Specific Sub-network for Multilingual Machine Translation , 2021, ACL.
[15] Ankur Bapna,et al. Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation , 2021, ICLR.
[16] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[17] Akiko Aizawa,et al. Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models , 2021, EMNLP.
[18] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[19] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.
[20] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[21] Xiang Li,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[22] Ronghang Hu,et al. UniT: Multimodal Multitask Learning with a Unified Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[23] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[25] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[26] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.
[27] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..
[28] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[29] Zhe Gan,et al. Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling , 2021, ArXiv.
[30] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Yulia Tsvetkov,et al. On Negative Interference in Multilingual Language Models , 2020, EMNLP.
[32] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.
[33] Luc Van Gool,et al. Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference , 2020, ECCV.
[34] Yiannis Andreopoulos,et al. Biased Mixtures of Experts: Enabling Computer Vision Inference Under Data Transfer Limitations , 2020, IEEE Transactions on Image Processing.
[35] Yulia Tsvetkov,et al. Balancing Training for Multilingual Neural Machine Translation , 2020, ACL.
[36] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[37] Bing Li,et al. Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.
[39] S. Levine,et al. Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.
[40] Marcus Rohrbach,et al. 12-in-1: Multi-Task Vision and Language Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Jianfeng Gao,et al. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.
[42] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[43] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.
[44] Jitendra Malik,et al. Which Tasks Should Be Learned Together in Multi-task Learning? , 2019, ICML.
[45] Ruslan Salakhutdinov,et al. Embodied Multimodal Multitask Learning , 2019, IJCAI.
[46] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[47] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[48] Chris Hokamp,et al. Evaluating the Supervised and Zero-shot Performance of Multi-lingual Translation Models , 2019, WMT.
[49] Quoc V. Le,et al. CondConv: Conditionally Parameterized Convolutions for Efficient Inference , 2019, NeurIPS.
[50] Trevor Darrell,et al. Deep Mixture of Experts via Shallow Embedding , 2018, UAI.
[51] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[52] Li Fei-Fei,et al. Dynamic Task Prioritization for Multitask Learning , 2018, ECCV.
[53] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[54] Zhao Chen,et al. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.
[55] Roberto Cipolla,et al. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[56] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[57] Xuanjing Huang,et al. Adversarial Multi-task Learning for Text Classification , 2017, ACL.
[58] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[59] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[60] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[61] Andreas Dengel,et al. Real-time Analysis and Visualization of the YFCC100m Dataset , 2015, MMCommons '15.
[62] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[63] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[64] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.
[65] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[66] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[67] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[68] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[69] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.