Personalized Showcases: Generating Multi-Modal Explanations for Recommendations

Existing explanation models generate only text for recommendations but still struggle to produce diverse contents. In this paper, to further enrich explanations, we propose a new task named personalized showcases, in which we provide both textual and visual information to explain our recommendations. Specifically, we first select a personalized image set that is the most relevant to a user's interest toward a recommended item. Then, natural language explanations are generated accordingly given our selected images. For this new task, we collect a large-scale dataset from Google Maps and construct a high-quality subset for generating multi-modal explanations. We propose a personalized multi-modal framework which can generate diverse and visually-aligned explanations via contrastive learning. Experiments show that our framework benefits from different modalities as inputs, and is able to produce more diverse and expressive explanations compared to previous methods on a variety of evaluation metrics.

[1]  William Yang Wang,et al.  CLIP also Understands Text: Prompting CLIP for Phrase Understanding , 2022, ArXiv.

[2]  William Yang Wang,et al.  Visualize Before You Write: Imagination-Guided Open-Ended Text Generation , 2022, FINDINGS.

[3]  Jinyang Gao,et al.  Contrastive Learning for Sequential Recommendation , 2022, 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[4]  Julian McAuley,et al.  Personalized Complementary Product Recommendation , 2022, WWW.

[5]  Chun-Nan Hsu,et al.  Weakly Supervised Contrastive Learning for Chest X-Ray Report Generation , 2021, EMNLP.

[6]  Kurt Keutzer,et al.  How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.

[7]  Liqiang Nie,et al.  Contrastive Learning for Cold-Start Recommendation , 2021, ACM Multimedia.

[8]  M. Eckstein,et al.  ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation , 2021, FINDINGS.

[9]  Yongfeng Zhang,et al.  Personalized Transformer for Explainable Recommendation , 2021, ACL.

[10]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[11]  Danqi Chen,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[12]  Mohamed Elhoseiny,et al.  VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  William Yang Wang,et al.  L2C: Describing Visual Differences Needs Semantic Understanding of Individuals , 2021, EACL.

[14]  Dong Bok Lee,et al.  Contrastive Learning with Adversarial Perturbations for Conditional Text Generation , 2020, ICLR.

[15]  Tsung-Hui Chang,et al.  Generating Radiology Reports via Memory-driven Transformer , 2020, EMNLP.

[16]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[17]  Xiaofang Zhao,et al.  Group-wise Contrastive Learning for Neural Dialogue Generation , 2020, FINDINGS.

[18]  Jingren Zhou,et al.  Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems , 2020, KDD.

[19]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[20]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[21]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[22]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[23]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[25]  Shuo Cheng,et al.  CosRec: 2D Convolutional Neural Networks for Sequential Recommendation , 2019, CIKM.

[26]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[27]  Quoc-Tuan Truong,et al.  Multimodal Review Generation for Recommender Systems , 2019, WWW.

[28]  Xing Xie,et al.  Co-Attentive Multi-Task Learning for Explainable Recommendation , 2019, IJCAI.

[29]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[30]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[31]  Chang Zhou,et al.  Personalized Bundle List Recommendation , 2019, WWW.

[32]  Piji Li,et al.  Persona-Aware Tips Generation? , 2019, WWW.

[33]  Ajith Ramanathan,et al.  Practical Diversified Recommendations on YouTube with Determinantal Point Processes , 2018, CIKM.

[34]  Alan Ritter,et al.  Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints , 2018, EMNLP.

[35]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[36]  Wei Ping,et al.  Large Margin Neural Language Model , 2018, EMNLP.

[37]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[38]  Julian J. McAuley,et al.  Personalized Review Generation By Expanding Phrases and Attending on Aspect-Aware Representations , 2018, ACL.

[39]  Xu Chen,et al.  Explainable Recommendation: A Survey and New Perspectives , 2018, Found. Trends Inf. Retr..

[40]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[41]  Xiaojun Wan,et al.  Towards Automatic Generation of Product Reviews from Aspect-Sentiment Scores , 2017, INLG.

[42]  Piji Li,et al.  Neural Rating Regression with Abstractive Tips Generation for Recommendation , 2017, SIGIR.

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[44]  Mirella Lapata,et al.  Learning to Generate Product Reviews from Attributes , 2017, EACL.

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[47]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[48]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[49]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[50]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[51]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[52]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[53]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[54]  G. Doddington Automatic evaluation of machine translation quality using n-gram co-occurrence statistics , 2002 .

[55]  Liqiang Nie,et al.  Micro-Influencer Recommendation by Multi-Perspective Account Representation Learning , 2023, IEEE Transactions on Multimedia.

[56]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[57]  Pengtao Xie,et al.  CERT: Contrastive Self-supervised Learning for Language Understanding , 2020, ArXiv.

[58]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.