Masked Vision-language Transformer in Fashion

[1]  Heng Tao Shen,et al.  Modality-Invariant Asymmetric Networks for Cross-Modal Hashing , 2023, IEEE Transactions on Knowledge and Data Engineering.

[2]  Min Wang,et al.  Cross-Modal Retrieval with Heterogeneous Graph Embedding , 2022, ACM Multimedia.

[3]  Richang Hong,et al.  Multi-scale Spatial Representation Learning via Recursive Hermite Polynomial Networks , 2022, IJCAI.

[4]  P. Natarajan,et al.  FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Lingqiao Liu,et al.  Verbal-Person Nets: Pose-Guided Multi-Granularity Language-to-Person Generation. , 2022, IEEE transactions on neural networks and learning systems.

[6]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Xipeng Qiu,et al.  Paradigm Shift in Natural Language Processing , 2021, Machine Intelligence Research.

[8]  Tarik Arici,et al.  MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling , 2021, ArXiv.

[9]  Wei Wang,et al.  Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training , 2021, ArXiv.

[10]  Jie Tang,et al.  M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining , 2021, KDD.

[11]  Alec Radford,et al.  Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications , 2021, ArXiv.

[12]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[13]  Songfang Huang,et al.  E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning , 2021, ACL.

[14]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[15]  Jianlong Fu,et al.  Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ling Shao,et al.  Kaleido-BERT: Vision-Language Pre-training on Fashion Domain , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Xianyan Jia,et al.  M6: A Chinese Multimodal Pretrainer , 2021, ArXiv.

[18]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[19]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[20]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[23]  Shih-Fu Chang,et al.  VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  L. Shao,et al.  Salient Object Detection via Integrity Learning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[26]  Chi-Hao Wu,et al.  Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards , 2020, ECCV.

[27]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[28]  Hao Wang,et al.  FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval , 2020, SIGIR.

[29]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[30]  Kristen Grauman,et al.  From Paris to Berlin: Discovering Fashion Style Influences Around the World , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[32]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[33]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[34]  Li Wei,et al.  Sampling-bias-corrected neural modeling for large corpus item recommendations , 2019, RecSys.

[35]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[36]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[37]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[38]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[39]  David Reitter,et al.  Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[40]  Xueming Qian,et al.  Position Focused Attention Network for Image-Text Matching , 2019, IJCAI.

[41]  Steven J. Rennie,et al.  Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Isay Katsman,et al.  Fashion++: Minimal Edits for Outfit Improvement , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Ying Zhang,et al.  Fashion-Gen: The Generative Fashion Dataset and Challenge , 2018, ArXiv.

[44]  David A. Forsyth,et al.  Learning Type-Aware Embeddings for Fashion Compatibility , 2018, ECCV.

[45]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[46]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[52]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[53]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[54]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[55]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Mingchen Zhuge,et al.  Skating-Mixer: Multimodal MLP for Scoring Figure Skating , 2022 .

[57]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[59]  B. ackground TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , 2018 .