Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

In e-commerce, products and micro-videos serve as two primary carriers. Introducing cross-domain retrieval between these carriers can establish associations, thereby leading to the advancement of specific scenarios, such as retrieving products based on micro-videos or recommending relevant videos based on products. However, existing datasets only focus on retrieval within the product domain while neglecting the micro-video domain and often ignore the multi-modal characteristics of the product domain. Additionally, these datasets strictly limit their data scale through content alignment and use a content-based data organization format that hinders the inclusion of user retrieval intentions. To address these limitations, we propose the PKU Real20M dataset, a large-scale e-commerce dataset designed for cross-domain retrieval. We adopt a query-driven approach to efficiently gather over 20 million e-commerce products and micro-videos, including multimodal information. Additionally, we design a three-level entity prompt learning framework to align inter-modality information from coarse to fine. Moreover, we introduce the Query-driven Cross-Domain retrieval framework (QCD), which leverages user queries to facilitate efficient alignment between the product and micro-video domains. Extensive experiments on two downstream tasks validate the effectiveness of our proposed approaches. The dataset and source code are available at https://github.com/PKU-ICST-MIPL/Real20M_ACMMM2023.

[1]  Yunchao Wei,et al.  AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval , 2023, ACM Transactions on Multimedia Computing, Communications, and Applications.

[2]  Jingren Zhou,et al.  Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese , 2022, ArXiv.

[3]  Fuhai Chen,et al.  Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[4]  A. Bimbo,et al.  Search-oriented Micro-video Captioning , 2022, ACM Multimedia.

[5]  Haibin Ling,et al.  Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.

[6]  P. Natarajan,et al.  FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  A. Bimbo,et al.  Effective conditioned and composed image retrieval combining CLIP-based features , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  G. Medioni,et al.  OutfitTransformer: Outfit Representations for Fashion Recommendation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9]  Ryan A. Rossi,et al.  CyCLIP: Cyclic Contrastive Language-Image Pretraining , 2022, NeurIPS.

[10]  Rafael Sampaio de Rezende,et al.  ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity , 2022, 2203.08101.

[11]  Manish Pathak,et al.  Solving Fashion Recommendation - The Farfetch Challenge , 2021, ArXiv.

[12]  Xiao Dong,et al.  Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Weixiang Hong,et al.  GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval , 2021, SIGIR.

[14]  Bohyung Han,et al.  CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[16]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[17]  Peng Gao,et al.  Contrastive Visual-Linguistic Pretraining , 2020, ArXiv.

[18]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[19]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[20]  Steven J. Rennie,et al.  Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ruimao Zhang,et al.  DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ying Zhang,et al.  Fashion-Gen: The Generative Fashion Dataset and Challenge , 2018, ArXiv.

[23]  Hedi Ben-younes,et al.  Leveraging Weakly Annotated Data for Fashion Image Retrieval and Label Prediction , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[24]  Larry S. Davis,et al.  Automatic Spatially-Aware Fashion Concept Discovery , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.