EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval

Cross language-image modality retrieval in E-commerce is a fundamental problem for product search, recommendation, and marketing services. Extensive efforts have been made to conquer the cross-modal retrieval problem in the general domain. When it comes to E-commerce, a com-mon practice is to adopt the pretrained model and finetune on E-commerce data. Despite its simplicity, the performance is sub-optimal due to overlooking the uniqueness of E-commerce multimodal data. A few recent efforts [10], [72] have shown significant improvements over generic methods with customized designs for handling product images. Unfortunately, to the best of our knowledge, no existing method has addressed the unique challenges in the e-commerce language. This work studies the outstanding one, where it has a large collection of special meaning entities, e.g., “Di s s e l (brand)”, “Top (category)”, “relaxed (fit)” in the fashion clothing business. By formulating such out-of-distribution finetuning process in the Causal Inference paradigm, we view the erroneous semantics of these special entities as confounders to cause the retrieval failure. To rectify these semantics for aligning with e-commerce do-main knowledge, we propose an intervention-based entity-aware contrastive learning framework with two modules, i.e., the Confounding Entity Selection Module and Entity-Aware Learning Module. Our method achieves competitive performance on the E-commerce benchmark Fashion-Gen. Particularly, in top-1 accuracy (R@l), we observe 10.3% and 10.5% relative improvements over the closest baseline in image-to-text and text-to-image retrievals, respectively.

[1]  Fenglin Liu,et al.  Aligning Source Visual and Target Language Domains for Unpaired Video Captioning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Deying Kong,et al.  AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[3]  Deying Kong,et al.  TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation , 2021, BMVC.

[4]  Tat-Seng Chua,et al.  Interventional Video Relation Detection , 2021, ACM Multimedia.

[5]  Chenyu You,et al.  Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering , 2021, EMNLP.

[6]  J. Duncan,et al.  SimCVD: Simple Contrastive Voxel-Wise Representation Distillation for Semi-Supervised Medical Image Segmentation , 2021, IEEE Transactions on Medical Imaging.

[7]  Alec Radford,et al.  Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications , 2021, ArXiv.

[8]  Xiao Dong,et al.  Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Xinxiao Wu,et al.  Boosting Entity-Aware Image Captioning With Multi-Modal Knowledge Graph , 2021, IEEE Transactions on Multimedia.

[10]  Wenkai Zhang,et al.  De-biasing Distantly Supervised Named Entity Recognition via Causal Intervention , 2021, ACL.

[11]  Meng Wang,et al.  Deconfounded Video Moment Retrieval with Causal Intervention , 2021, SIGIR.

[12]  Ling Shao,et al.  Kaleido-BERT: Vision-Language Pre-training on Fashion Domain , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bing Deng,et al.  The Blessings of Unlabeled Background in Untrimmed Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jianfei Cai,et al.  Causal Attention for Vision-Language Tasks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[16]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[17]  Francis E. H. Tay,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[19]  Zhangyang Wang,et al.  Graph Contrastive Learning with Augmentations , 2020, NeurIPS.

[20]  Hanwang Zhang,et al.  Interventional Few-Shot Learning , 2020, NeurIPS.

[21]  Jinhui Tang,et al.  Causal Intervention for Weakly-Supervised Semantic Segmentation , 2020, NeurIPS.

[22]  Zhou Zhao,et al.  DeVLBert: Learning Deconfounded Visio-Linguistic Representations , 2020, ACM Multimedia.

[23]  Walid Krichene,et al.  On Sampled Metrics for Item Recommendation , 2020, KDD.

[24]  Yang Zhang,et al.  Modality-Agnostic Attention Fusion for visual search with text feedback , 2020, ArXiv.

[25]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[26]  Hanwang Zhang,et al.  Visual Commonsense Representation Learning via Causal Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Hao Wang,et al.  FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval , 2020, SIGIR.

[28]  Lexing Xie,et al.  Transform and Tell: Entity-Aware News Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[30]  Jiebo Luo,et al.  Adaptive Offline Quintuplet Loss for Image-Text Matching , 2020, ECCV.

[31]  Hanwang Zhang,et al.  Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[33]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[34]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[35]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[37]  Xilin Chen,et al.  Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[40]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[41]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[42]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[43]  Xueming Qian,et al.  Position Focused Attention Network for Image-Text Matching , 2019, IJCAI.

[44]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Mark Dredze,et al.  Challenges of Using Text Classifiers for Causal Inference , 2018, EMNLP.

[46]  Taku Komura,et al.  Mode-adaptive neural networks for quadruped motion control , 2018, ACM Trans. Graph..

[47]  Ying Zhang,et al.  Fashion-Gen: The Generative Fashion Dataset and Challenge , 2018, ArXiv.

[48]  Heng Ji,et al.  Entity-aware Image Caption Generation , 2018, EMNLP.

[49]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[50]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[51]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[52]  Yun Fu,et al.  Multi-View Clustering via Deep Matrix Factorization , 2017, AAAI.

[53]  Fei Su,et al.  Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval , 2016, Neurocomputing.

[54]  Peter Jansen,et al.  Creating Causal Embeddings for Question Answering with Minimal Supervision , 2016, EMNLP.

[55]  Yun Fu,et al.  Incomplete Multi-Modal Visual Data Grouping , 2016, IJCAI.

[56]  Bernhard Schölkopf,et al.  Discovering Causal Signals in Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Julian J. McAuley,et al.  Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering , 2016, WWW.

[58]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[59]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[61]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[62]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Pietro Perona,et al.  Visual Causal Feature Learning , 2014, UAI.

[64]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[65]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[66]  J. Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[67]  Josef Kittler,et al.  Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[69]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[70]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[72]  Causality : Models , Reasoning , and Inference , 2022 .