论文信息 - Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

In e-commerce, products and micro-videos serve as two primary carriers. Introducing cross-domain retrieval between these carriers can establish associations, thereby leading to the advancement of specific scenarios, such as retrieving products based on micro-videos or recommending relevant videos based on products. However, existing datasets only focus on retrieval within the product domain while neglecting the micro-video domain and often ignore the multi-modal characteristics of the product domain. Additionally, these datasets strictly limit their data scale through content alignment and use a content-based data organization format that hinders the inclusion of user retrieval intentions. To address these limitations, we propose the PKU Real20M dataset, a large-scale e-commerce dataset designed for cross-domain retrieval. We adopt a query-driven approach to efficiently gather over 20 million e-commerce products and micro-videos, including multimodal information. Additionally, we design a three-level entity prompt learning framework to align inter-modality information from coarse to fine. Moreover, we introduce the Query-driven Cross-Domain retrieval framework (QCD), which leverages user queries to facilitate efficient alignment between the product and micro-video domains. Extensive experiments on two downstream tasks validate the effectiveness of our proposed approaches. The dataset and source code are available at https://github.com/PKU-ICST-MIPL/Real20M_ACMMM2023.

Yuxin Peng | Xiangteng He | Lele Cheng | Yanzhe Chen | Huasong Zhong

[1] Yunchao Wei,et al. AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval , 2023, ACM Transactions on Multimedia Computing, Communications, and Applications.

[2] Jingren Zhou,et al. Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese , 2022, ArXiv.

[3] Fuhai Chen,et al. Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[4] A. Bimbo,et al. Search-oriented Micro-video Captioning , 2022, ACM Multimedia.

[5] Haibin Ling,et al. Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.

[6] P. Natarajan,et al. FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] A. Bimbo,et al. Effective conditioned and composed image retrieval combining CLIP-based features , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] G. Medioni,et al. OutfitTransformer: Outfit Representations for Fashion Recommendation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9] Ryan A. Rossi,et al. CyCLIP: Cyclic Contrastive Language-Image Pretraining , 2022, NeurIPS.

[10] Rafael Sampaio de Rezende,et al. ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity , 2022, 2203.08101.

[11] Manish Pathak,et al. Solving Fashion Recommendation - The Farfetch Challenge , 2021, ArXiv.

[12] Xiao Dong,et al. Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Weixiang Hong,et al. GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval , 2021, SIGIR.

[14] Bohyung Han,et al. CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[16] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[17] Peng Gao,et al. Contrastive Visual-Linguistic Pretraining , 2020, ArXiv.

[18] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[19] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[20] Steven J. Rennie,et al. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Ruimao Zhang,et al. DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Ying Zhang,et al. Fashion-Gen: The Generative Fashion Dataset and Challenge , 2018, ArXiv.

[23] Hedi Ben-younes,et al. Leveraging Weakly Annotated Data for Fashion Image Retrieval and Label Prediction , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[24] Larry S. Davis,et al. Automatic Spatially-Aware Fashion Concept Discovery , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[26] Xiaogang Wang,et al. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.