论文信息 - Masked Vision-language Transformer in Fashion - 字舞流文

Masked Vision-language Transformer in Fashion

L. Gool | Christos Sakaridis | Mingchen Zhuge | D. Gao | Deng-Ping Fan | Ge-Peng Ji

[1] Heng Tao Shen,et al. Modality-Invariant Asymmetric Networks for Cross-Modal Hashing , 2023, IEEE Transactions on Knowledge and Data Engineering.

[2] Min Wang,et al. Cross-Modal Retrieval with Heterogeneous Graph Embedding , 2022, ACM Multimedia.

[3] Richang Hong,et al. Multi-scale Spatial Representation Learning via Recursive Hermite Polynomial Networks , 2022, IJCAI.

[4] P. Natarajan,et al. FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Lingqiao Liu,et al. Verbal-Person Nets: Pose-Guided Multi-Granularity Language-to-Person Generation. , 2022, IEEE transactions on neural networks and learning systems.

[6] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Xipeng Qiu,et al. Paradigm Shift in Natural Language Processing , 2021, Machine Intelligence Research.

[8] Tarik Arici,et al. MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling , 2021, ArXiv.

[9] Wei Wang,et al. Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training , 2021, ArXiv.

[10] Jie Tang,et al. M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining , 2021, KDD.

[11] Alec Radford,et al. Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications , 2021, ArXiv.

[12] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[13] Songfang Huang,et al. E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning , 2021, ACL.

[14] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[15] Jianlong Fu,et al. Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Ling Shao,et al. Kaleido-BERT: Vision-Language Pre-training on Fashion Domain , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Xianyan Jia,et al. M6: A Chinese Multimodal Pretrainer , 2021, ArXiv.

[18] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[19] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[20] Xiang Li,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[23] Shih-Fu Chang,et al. VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] L. Shao,et al. Salient Object Detection via Integrity Learning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[26] Chi-Hao Wu,et al. Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards , 2020, ECCV.

[27] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[28] Hao Wang,et al. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval , 2020, SIGIR.

[29] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[30] Kristen Grauman,et al. From Paris to Berlin: Discovering Fashion Style Influences Around the World , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Jianlong Fu,et al. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[32] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[33] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[34] Li Wei,et al. Sampling-bias-corrected neural modeling for large corpus item recommendations , 2019, RecSys.

[35] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[36] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[37] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[38] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[39] David Reitter,et al. Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[40] Xueming Qian,et al. Position Focused Attention Network for Image-Text Matching , 2019, IJCAI.

[41] Steven J. Rennie,et al. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Isay Katsman,et al. Fashion++: Minimal Edits for Outfit Improvement , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43] Ying Zhang,et al. Fashion-Gen: The Generative Fashion Dataset and Challenge , 2018, ArXiv.

[44] David A. Forsyth,et al. Learning Type-Aware Embeddings for Fashion Compatibility , 2018, ECCV.

[45] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[46] Gang Hua,et al. Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[48] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[49] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[52] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[53] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[54] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[55] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[56] Mingchen Zhuge,et al. Skating-Mixer: Multimodal MLP for Scoring Figure Skating , 2022 .

[57] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[58] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[59] B. ackground. TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , 2018 .