DATE: Domain Adaptive Product Seeker for E-Commerce

Product Retrieval (PR) and Grounding (PG), aiming to seek image and object-level products respectively according to a textual query, have attracted great interest recently for better shopping experience. Owing to the lack of relevant datasets, we collect two large-scale benchmark datasets from Taobao Mall and Live domains with about 474k and 101k image-query pairs for PR, and manually annotate the object bounding boxes in each image for PG. As annotating boxes is expensive and time-consuming, we attempt to transfer knowledge from annotated domain to unannotated for PG to achieve un-supervised Domain Adaptation (PG-DA). We propose a {\bf D}omain {\bf A}daptive Produc{\bf t} S{\bf e}eker ({\bf DATE}) framework, regarding PR and PG as Product Seeking problem at different levels, to assist the query {\bf date} the product. Concretely, we first design a semantics-aggregated feature extractor for each modality to obtain concentrated and comprehensive features for following efficient retrieval and fine-grained grounding tasks. Then, we present two cooperative seekers to simultaneously search the image for PR and localize the product for PG. Besides, we devise a domain aligner for PG-DA to alleviate uni-modal marginal and multi-modal conditional distribution shift between source and target domains, and design a pseudo box generator to dynamically select reliable instances and generate bounding boxes for further knowledge transfer. Extensive experiments show that our DATE achieves satisfactory performance in fully-supervised PR, PG and un-supervised PG-DA. Our desensitized datasets will be publicly available here\footnote{\url{https://github.com/Taobao-live/Product-Seeking}}.

[1]  Yi Ren,et al.  Video-Guided Curriculum Learning for Spoken Video Grounding , 2022, ACM Multimedia.

[2]  Yi Ren,et al.  ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech , 2022, ACM Multimedia.

[3]  Xiaofei He,et al.  MLSLT: Towards Multilingual Sign Language Translation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yi Ren,et al.  GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis , 2022, NeurIPS.

[5]  Fei Wu,et al.  Interaction augmented transformer with decoupled decoding for video captioning , 2022, Neurocomputing.

[6]  Fan Yang,et al.  Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss , 2021, ArXiv.

[7]  Zhou Zhao,et al.  SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory , 2021, ACM Multimedia.

[8]  Zhou Zhao,et al.  Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Samuel Albanie,et al.  Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval , 2021, AAAI.

[10]  Wengang Zhou,et al.  TransVG: End-to-End Visual Grounding with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Yueting Zhuang,et al.  Disentangled Motif-aware Graph Learning for Phrase Grounding , 2021, AAAI.

[12]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Lin Ma,et al.  Relation-aware Instance Refinement for Weakly Supervised Visual Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[15]  Tiancheng Zhao,et al.  VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words , 2021, ACL.

[16]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[17]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[18]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yingming Li,et al.  Dual Low-Rank Multimodal Fusion , 2020, FINDINGS.

[20]  Vicky S. Kalogeiton,et al.  Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval , 2020, ECCV.

[21]  Yingming Li,et al.  SBAT: Video Captioning with Sparse Boundary-Aware Transformer , 2020, IJCAI.

[22]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[23]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[25]  Krystian Mikolajczyk,et al.  SOLAR: Second-Order Loss and Attention for Image Retrieval , 2020, ECCV.

[26]  C. Qian,et al.  A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Michael W. Mahoney,et al.  An Effective Framework for Weakly-Supervised Phrase Grounding , 2020, Conference on Empirical Methods in Natural Language Processing.

[28]  Yingming Li,et al.  Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning , 2019, EMNLP.

[29]  Qingming Huang,et al.  Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Jiebo Luo,et al.  A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Jon Almazán,et al.  Learning With Average Precision: Training Image Retrieval With a Listwise Loss , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Wenhui Li,et al.  Cross-Domain 3D Model Retrieval via Visual Domain Adaption , 2018, IJCAI.

[35]  Hexiang Hu,et al.  Cross-Dataset Adaptation for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[38]  Michael I. Jordan,et al.  Conditional Adversarial Domain Adaptation , 2017, NeurIPS.

[39]  Ramakant Nevatia,et al.  Query-Guided Regression Network with Context Policy for Phrase Grounding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Ngai-Man Cheung,et al.  Selective Deep Convolutional Features for Image Retrieval , 2017, ACM Multimedia.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Nicolas Courty,et al.  Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  George Trigeorgis,et al.  Domain Separation Networks , 2016, NIPS.

[46]  Gabriela Csurka,et al.  A Domain Adaptation Regularization for Denoising Autoencoders , 2016, ACL.

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.