EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval
暂无分享,去创建一个
Zhangyang Wang | Xiaohui Xie | Zhangyang Wang | Jiuxiang Gu | Tong Yu | Xiaohui Xie | Tong Yu | Handong Zhao | Sunav Choudhary | Haoyu Ma | Zhe Lin | Ajinkya Kale | Zhe Lin
[1] Fenglin Liu,et al. Aligning Source Visual and Target Language Domains for Unpaired Video Captioning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[2] Deying Kong,et al. AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[3] Deying Kong,et al. TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation , 2021, BMVC.
[4] Tat-Seng Chua,et al. Interventional Video Relation Detection , 2021, ACM Multimedia.
[5] Chenyu You,et al. Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering , 2021, EMNLP.
[6] J. Duncan,et al. SimCVD: Simple Contrastive Voxel-Wise Representation Distillation for Semi-Supervised Medical Image Segmentation , 2021, IEEE Transactions on Medical Imaging.
[7] Alec Radford,et al. Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications , 2021, ArXiv.
[8] Xiao Dong,et al. Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[9] Xinxiao Wu,et al. Boosting Entity-Aware Image Captioning With Multi-Modal Knowledge Graph , 2021, IEEE Transactions on Multimedia.
[10] Wenkai Zhang,et al. De-biasing Distantly Supervised Named Entity Recognition via Causal Intervention , 2021, ACL.
[11] Meng Wang,et al. Deconfounded Video Moment Retrieval with Causal Intervention , 2021, SIGIR.
[12] Ling Shao,et al. Kaleido-BERT: Vision-Language Pre-training on Fashion Domain , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Bing Deng,et al. The Blessings of Unlabeled Background in Untrimmed Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Jianfei Cai,et al. Causal Attention for Vision-Language Tasks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[16] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[17] Francis E. H. Tay,et al. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[18] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[19] Zhangyang Wang,et al. Graph Contrastive Learning with Augmentations , 2020, NeurIPS.
[20] Hanwang Zhang,et al. Interventional Few-Shot Learning , 2020, NeurIPS.
[21] Jinhui Tang,et al. Causal Intervention for Weakly-Supervised Semantic Segmentation , 2020, NeurIPS.
[22] Zhou Zhao,et al. DeVLBert: Learning Deconfounded Visio-Linguistic Representations , 2020, ACM Multimedia.
[23] Walid Krichene,et al. On Sampled Metrics for Item Recommendation , 2020, KDD.
[24] Yang Zhang,et al. Modality-Agnostic Attention Fusion for visual search with text feedback , 2020, ArXiv.
[25] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.
[26] Hanwang Zhang,et al. Visual Commonsense Representation Learning via Causal Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[27] Hao Wang,et al. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval , 2020, SIGIR.
[28] Lexing Xie,et al. Transform and Tell: Entity-Aware News Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[30] Jiebo Luo,et al. Adaptive Offline Quintuplet Loss for Image-Text Matching , 2020, ECCV.
[31] Hanwang Zhang,et al. Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[33] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[34] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.
[35] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Jianmo Ni,et al. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.
[37] Xilin Chen,et al. Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[38] Yun Fu,et al. Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[40] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[41] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[42] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[43] Xueming Qian,et al. Position Focused Attention Network for Image-Text Matching , 2019, IJCAI.
[44] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[45] Mark Dredze,et al. Challenges of Using Text Classifiers for Causal Inference , 2018, EMNLP.
[46] Taku Komura,et al. Mode-adaptive neural networks for quadruped motion control , 2018, ACM Trans. Graph..
[47] Ying Zhang,et al. Fashion-Gen: The Generative Fashion Dataset and Challenge , 2018, ArXiv.
[48] Heng Ji,et al. Entity-aware Image Caption Generation , 2018, EMNLP.
[49] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.
[50] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.
[51] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[52] Yun Fu,et al. Multi-View Clustering via Deep Matrix Factorization , 2017, AAAI.
[53] Fei Su,et al. Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval , 2016, Neurocomputing.
[54] Peter Jansen,et al. Creating Causal Embeddings for Question Answering with Minimal Supervision , 2016, EMNLP.
[55] Yun Fu,et al. Incomplete Multi-Modal Visual Data Grouping , 2016, IJCAI.
[56] Bernhard Schölkopf,et al. Discovering Causal Signals in Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[57] Julian J. McAuley,et al. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering , 2016, WWW.
[58] Alexandra Birch,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[59] Krystian Mikolajczyk,et al. Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.
[61] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[62] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Pietro Perona,et al. Visual Causal Feature Learning , 2014, UAI.
[64] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.
[65] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.
[66] J. Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.
[67] Josef Kittler,et al. Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[68] John Shawe-Taylor,et al. Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.
[69] S. Hochreiter,et al. Long Short-Term Memory , 1997, Neural Computation.
[70] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[71] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[72] Causality : Models , Reasoning , and Inference , 2022 .