Extending CLIP for Category-to-image Retrieval in E-commerce

E-commerce provides rich multimodal data that is barely leveraged in practice. One aspect of this data is a category tree that is being used in search and recommendation. However, in practice, during a user’s session there is often a mismatch between a textual and a visual representation of a given category. Motivated by the problem, we introduce the task of category-to-image retrieval in e-commerce and propose a model for the task, CLIP-ITA. The model leverages information from multiple modalities (textual, visual, and attribute modality) to create product representations. We explore how adding information from multiple modalities (textual, visual, and attribute modality) impacts the model’s performance. In particular, we observe that CLIP-ITA significantly outperforms a comparable model that leverages only the visual modality and a comparable model that leverages the visual and attribute modality.

[1]  Guido Zuccon,et al.  BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval , 2021, ICTIR.

[2]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[3]  M. de Rijke,et al.  Improving Outfit Recommendation with Co-supervision of Fashion Generation , 2019, WWW.

[4]  Jan Kautz,et al.  Contrastive Learning for Weakly Supervised Phrase Grounding , 2020, ECCV.

[5]  Marie-Francine Moens,et al.  Web Search of Fashion Items with Multimodal Querying , 2018, WSDM.

[6]  Xiangnan He,et al.  Hierarchical Fashion Graph Network for Personalized Outfit Recommendation , 2020, SIGIR.

[7]  K. Goei Tackling Attribute Fine-grainedness in Cross-modal Fashion Search with Multi-level Features , 2021 .

[8]  Li Fei-Fei,et al.  Composing Text and Image for Image Retrieval - an Empirical Odyssey , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Marie-Francine Moens,et al.  Cross-modal search for fashion attributes , 2017, KDD 2017.

[10]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[11]  Indika Perera,et al.  Multimodal user interaction framework for e-commerce , 2019, 2019 International Research Conference on Smart Computing and Systems Engineering (SCSE).

[12]  Marie-Francine Moens,et al.  A Comparative Study of Outfit Recommendation Methods with a Focus on Attention-based Fusion , 2020, Inf. Process. Manag..

[13]  Evangelos Kanoulas,et al.  Category Aware Explainable Conversational Recommendation , 2021, ArXiv.

[14]  Marie-Francine Moens,et al.  Cross-Modal Fashion Search , 2016, MMM.

[15]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[17]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Marie-Francine Moens,et al.  Multimodal Neural Machine Translation of Fashion E-Commerce Descriptions , 2019 .

[19]  Kristen Grauman,et al.  Attributes as Operators , 2018, ECCV.

[20]  Christopher D. Manning,et al.  Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[21]  Guokun Lai,et al.  Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , 2020, NeurIPS.

[22]  Yusuke Yamaura,et al.  The Resale Price Prediction of Secondhand Jewelry Items Using a Multi-modal Deep Model with Iterative Co-Attention , 2019, ArXiv.

[23]  Tie-Yan Liu,et al.  MPNet: Masked and Permuted Pre-training for Language Understanding , 2020, NeurIPS.

[24]  Ivona Tautkute,et al.  DeepStyle: Multimodal Search Engine for Fashion and Interior Design , 2018, IEEE Access.

[25]  Jakob Nielsen,et al.  E-Commerce User Experience , 2001 .

[26]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[27]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[28]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[29]  Tat-Seng Chua,et al.  Interpretable Fashion Matching with Rich Attributes , 2019, SIGIR.

[30]  Kurt Keutzer,et al.  How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.

[31]  M. de Rijke,et al.  Challenges and research opportunities in eCommerce search and recommendations , 2020, SIGIR Forum.

[32]  Song Xu,et al.  Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products , 2020, AAAI.

[33]  Bingqing Yu,et al.  How to Grow a (Product) Tree: Personalized Category Suggestions for eCommerce Type-Ahead , 2020, ECNLP.

[34]  Artit Wangperawong,et al.  Multi-Label Product Categorization Using Multi-Modal Fusion Models , 2019, ArXiv.

[35]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[36]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Jonghwa Yim,et al.  One-Shot Item Search with Multimodal Data , 2018, ArXiv.

[38]  Takayuki Okatani,et al.  Learning to Describe E-Commerce Images from Noisy Online Data , 2016, ACCV.

[39]  Chong-Wah Ngo,et al.  Interpretable Multimodal Retrieval for Fashion Products , 2018, ACM Multimedia.

[40]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..