Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning

Interactive recommenders have demonstrated the advantage over traditional recommenders with dynamic change of items. However, the traditional user feedback in the format of clicks or ratings, provides limited user preference information and limited history tracking capabilities. As a result, it takes a user many interactions to find a desired item. Data of other modalities, such as item visual appearance and user comments in natural language, may enable richer user feedback. However, there are several critical challenges to be addressed when utilizing these multimodal data: multimodal matching, user preference tracking, and adaptation to dynamic unseen items. Without properly handling these challenges, the recommendations can easily violate the users' preference from their past natural language feedback. In this paper, we introduce a novel approach, called vision-language recommendation, that enables users to provide natural language feedback on visual products to have more natural and effective interactions. To model more explicit and accurate multimodal matching, we propose a novel visual attribute augmented reinforcement learning approach that enhances the grounding of natural language to visual items. Furthermore, to effectively track the users' preference and overcome the performance deficiency on dynamic unseen items after deployment, we propose a novel history multimodal matching reward to continuously adapt the model on-the-fly. Empirical results show that, our system augmented by visual attribute and history multimodal matching can significantly increase the success rate, reduce the number of recommendations that violate the user's previous feedback, and need less number of user interactions to find the desired items.

[1]  Kristen Grauman,et al.  Fine-Grained Comparisons with Attributes , 2017 .

[2]  Filip Radlinski,et al.  A Theoretical Framework for Conversational Search , 2017, CHIIR.

[3]  Giovanni Semeraro,et al.  Converse-Et-Impera: Exploiting Deep Learning and Hierarchical Reinforcement Learning for Conversational Recommender Systems , 2017, AI*IA.

[4]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Bohyung Han,et al.  Visual Reference Resolution using Attention Memory for Visual Dialog , 2017, NIPS.

[7]  Stefanie Tellex,et al.  Towards surveillance video search by natural language query , 2009, CIVR '09.

[8]  Filip Radlinski,et al.  Towards Conversational Recommender Systems , 2016, KDD.

[9]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[11]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kristen Grauman,et al.  Fine-Grained Visual Comparisons with Local Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yi Zhang,et al.  Conversational Recommender System , 2018, SIGIR.

[14]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[15]  Nicholas J. Belkin,et al.  Cases, scripts, and information-seeking strategies: On the design of interactive information retrieval systems , 1995 .

[16]  Bart Thomee,et al.  Interactive search in image retrieval: a survey , 2012, International Journal of Multimedia Information Retrieval.

[17]  Jeffrey Mark Siskind,et al.  Saying What You're Looking For: Linguistics Meets Video Search , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Adriana Kovashka,et al.  Attribute Pivots for Guiding Relevance Feedback in Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[20]  Carlos D. Castillo,et al.  An All-In-One Convolutional Neural Network for Face Analysis , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[21]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[23]  Xiaogang Wang,et al.  Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Thomas S. Huang,et al.  Relevance feedback: a power tool for interactive content-based image retrieval , 1998, IEEE Trans. Circuits Syst. Video Technol..

[25]  Adriana Kovashka,et al.  WhittleSearch: Image search with relative attribute feedback , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Adriana Kovashka,et al.  Attributes for Image Retrieval , 2017 .

[29]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[30]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Kristen Grauman,et al.  Semantic Jitter: Dense Supervision for Visual Comparisons via Synthetic Images , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Ming-Syan Chen,et al.  UbiShop: Commercial item recommendation using visual part-based object representation , 2016, Multimedia Tools and Applications.

[35]  Dwi H. Widyantoro,et al.  A framework of conversational recommender system based on user functional requirements , 2014, 2014 2nd International Conference on Information and Communication Technology (ICoICT).

[36]  M. de Rijke,et al.  Attentive Memory Networks: Efficient Machine Reading for Conversational Search , 2017, ArXiv.

[37]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Barry Smyth,et al.  Compound Critiques for Conversational Recommender Systems , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[40]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[42]  Olivier Pietquin,et al.  End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.

[43]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[44]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[45]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[46]  Gang Chen,et al.  iGlasses: A Novel Recommendation System for Best-fit Glasses , 2016, SIGIR.

[47]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[48]  Shih-Fu Chang,et al.  Deep Cross Residual Learning for Multitask Visual Recognition , 2016, ACM Multimedia.

[49]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[50]  Rogério Schmidt Feris,et al.  Dialog-based Interactive Image Retrieval , 2018, NeurIPS.

[51]  Hanqing Lu,et al.  WillHunter: interactive image retrieval with multilevel relevance , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[52]  Christopher Joseph Pal,et al.  Towards Deep Conversational Recommendations , 2018, NeurIPS.