FashionKLIP: Enhancing E-Commerce Image-Text Retrieval with Fashion Multi-Modal Conceptual Knowledge Graph

Image-text retrieval is a core task in the multi-modal domain, which arises a lot of attention from both research and industry communities. Recently, the booming of visual-language pre-trained (VLP) models has greatly enhanced the performance of cross-modal retrieval. However, the fine-grained interactions between objects from different modalities are far from well-established. This issue becomes more severe in the e-commerce domain, which lacks sufficient training data and fine-grained cross-modal knowledge. To alleviate the problem, this paper proposes a novel e-commerce knowledge-enhanced VLP model FashionKLIP. We first automatically establish a multi-modal conceptual knowledge graph from large-scale e-commerce image-text data, and then inject the prior knowledge into the VLP model to align across modalities at the conceptual level. The experiments conducted on a public benchmark dataset demonstrate that FashionKLIP effectively enhances the performance of e-commerce image-text retrieval upon state-of-the-art VLP models by a large margin. The application of the method in real industrial scenarios also proves the feasibility and efficiency of FashionKLIP.

[1]  Xiatian Zhu,et al.  FashionViL: Fashion-Focused Vision-and-Language Representation Learning , 2022, ECCV.

[2]  Minghui Qiu,et al.  EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing , 2022, EMNLP.

[3]  Tamara L. Berg,et al.  CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval , 2022, KDD.

[4]  Hang Li,et al.  Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.

[5]  Zhenguo Li,et al.  FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[6]  Huajun Chen,et al.  Knowledge Perceived Multi-modal Pretraining in E-commerce , 2021, ACM Multimedia.

[7]  Zhou Yu,et al.  ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration , 2021, ACM Multimedia.

[8]  Ling Shao,et al.  Kaleido-BERT: Vision-Language Pre-training on Fashion Domain , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[10]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[11]  Hua Wu,et al.  UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.

[12]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[13]  Hao Tian,et al.  ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[14]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[15]  Hao Wang,et al.  FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval , 2020, SIGIR.

[16]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[17]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[18]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[19]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[20]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[21]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[22]  Ying Zhang,et al.  Fashion-Gen: The Generative Fashion Dataset and Challenge , 2018, ArXiv.

[23]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[26]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[28]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[29]  Xin Jiang,et al.  Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework , 2022, ArXiv.

[30]  Jeff Z. Pan,et al.  Construction and Applications of Open Business Knowledge Graph , 2022 .

[31]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[32]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.