论文信息 - DATE: Domain Adaptive Product Seeker for E-Commerce

DATE: Domain Adaptive Product Seeker for E-Commerce

Product Retrieval (PR) and Grounding (PG), aiming to seek image and object-level products respectively according to a textual query, have attracted great interest recently for better shopping experience. Owing to the lack of relevant datasets, we collect two large-scale benchmark datasets from Taobao Mall and Live domains with about 474k and 101k image-query pairs for PR, and manually annotate the object bounding boxes in each image for PG. As annotating boxes is expensive and time-consuming, we attempt to transfer knowledge from annotated domain to unannotated for PG to achieve un-supervised Domain Adaptation (PG-DA). We propose a {\bf D}omain {\bf A}daptive Produc{\bf t} S{\bf e}eker ({\bf DATE}) framework, regarding PR and PG as Product Seeking problem at different levels, to assist the query {\bf date} the product. Concretely, we first design a semantics-aggregated feature extractor for each modality to obtain concentrated and comprehensive features for following efficient retrieval and fine-grained grounding tasks. Then, we present two cooperative seekers to simultaneously search the image for PR and localize the product for PG. Besides, we devise a domain aligner for PG-DA to alleviate uni-modal marginal and multi-modal conditional distribution shift between source and target domains, and design a pseudo box generator to dynamically select reliable instances and generate bounding boxes for further knowledge transfer. Extensive experiments show that our DATE achieves satisfactory performance in fully-supervised PR, PG and un-supervised PG-DA. Our desensitized datasets will be publicly available here\footnote{\url{https://github.com/Taobao-live/Product-Seeking}}.

[1] Yi Ren,et al. Video-Guided Curriculum Learning for Spoken Video Grounding , 2022, ACM Multimedia.

[2] Yi Ren,et al. ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech , 2022, ACM Multimedia.

[3] Xiaofei He,et al. MLSLT: Towards Multilingual Sign Language Translation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Yi Ren,et al. GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis , 2022, NeurIPS.

[5] Fei Wu,et al. Interaction augmented transformer with decoupled decoding for video captioning , 2022, Neurocomputing.

[6] Fan Yang,et al. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss , 2021, ArXiv.

[7] Zhou Zhao,et al. SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory , 2021, ACM Multimedia.

[8] Zhou Zhao,et al. Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Samuel Albanie,et al. Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval , 2021, AAAI.

[10] Wengang Zhou,et al. TransVG: End-to-End Visual Grounding with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] Yueting Zhuang,et al. Disentangled Motif-aware Graph Learning for Phrase Grounding , 2021, AAAI.

[12] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Lin Ma,et al. Relation-aware Instance Refinement for Weakly Supervised Visual Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[15] Tiancheng Zhao,et al. VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words , 2021, ACL.

[16] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[17] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[18] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] Yingming Li,et al. Dual Low-Rank Multimodal Fusion , 2020, FINDINGS.

[20] Vicky S. Kalogeiton,et al. Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval , 2020, ECCV.

[21] Yingming Li,et al. SBAT: Video Captioning with Sparse Boundary-Aware Transformer , 2020, IJCAI.

[22] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[23] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[25] Krystian Mikolajczyk,et al. SOLAR: Second-Order Loss and Attention for Image Retrieval , 2020, ECCV.

[26] C. Qian,et al. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Michael W. Mahoney,et al. An Effective Framework for Weakly-Supervised Phrase Grounding , 2020, Conference on Empirical Methods in Natural Language Processing.

[28] Yingming Li,et al. Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning , 2019, EMNLP.

[29] Qingming Huang,et al. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30] Jiebo Luo,et al. A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Jon Almazán,et al. Learning With Average Precision: Training Image Retrieval With a Listwise Loss , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Silvio Savarese,et al. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34] Wenhui Li,et al. Cross-Domain 3D Model Retrieval via Visual Domain Adaption , 2018, IJCAI.

[35] Hexiang Hu,et al. Cross-Dataset Adaptation for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Licheng Yu,et al. MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[38] Michael I. Jordan,et al. Conditional Adversarial Domain Adaptation , 2017, NeurIPS.

[39] Ramakant Nevatia,et al. Query-Guided Regression Network with Context Policy for Phrase Grounding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Ngai-Man Cheung,et al. Selective Deep Convolutional Features for Image Retrieval , 2017, ACM Multimedia.

[41] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[42] Trevor Darrell,et al. Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Nicolas Courty,et al. Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] George Trigeorgis,et al. Domain Separation Networks , 2016, NIPS.

[46] Gabriela Csurka,et al. A Domain Adaptation Regularization for Denoising Autoencoders , 2016, ACL.

[47] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.