Latent Structures Mining with Contrastive Modality Fusion for Multimedia Recommendation

Multimedia content is of predominance in the modern Web era. Recent years have witnessed growing research interests in multimedia recommendation, which aims to predict whether a user will interact with an item with multimodal contents. Most previous studies focus on modeling user-item interactions with multimodal features included as side information. However, this scheme is not well-designed for multimedia recommendation. Firstly, only collaborative item-item relationships are implicitly modeled through high-order item-user-item co-occurrences. Considering that items are associated with rich contents in multiple modalities, we argue that the latent semantic item-item structures underlying these multimodal contents could be beneficial for learning better item representations and assist the recommender models to comprehensively discover candidate items. Secondly, previous studies disregard the fine-grained multimodal fusion. Although having access to multiple modalities might allow us to capture rich information, we argue that the simple coarse-grained fusion by linear combination or concatenation in previous work is insufficient to fully understand content information of items and item relationships.To this end, we propose a latent structure MIning with ContRastive mOdality fusion method, which we term MICRO for brevity. To be specific, in the proposed MICRO model, we devise a novel modality-aware structure learning module, which learns item-item relationships for each modality. Based on the learned modality-aware latent item relationships, we perform graph convolutions which explicitly inject item affinities to modality-aware item representations. Additionally, we design a novel multi-modal contrastive framework to facilitate fine-grained multimodal fusion by forcing the modality-aware representation and multimodal fused representation to be close. Finally, these enriched item representations can be plugged into existing collaborative filtering methods to make more accurate recommendations. Extensive experiments on three real-world datasets demonstrate the superiority of our method over state-of-the-art multimedia recommendation methods and ablation studies validate the efficacy of mining latent item-item relationships and the contrastive multimodal fusion framework.

[1]  Hao Dong,et al.  Contrastive Multimodal Fusion with TupleInfoNCE , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Shu Wu,et al.  Graph-based Hierarchical Relevance Matching Signals for Ad-hoc Retrieval , 2021, WWW.

[3]  Bo Zong,et al.  Learning to Drop: Robust Graph Neural Network via Topological Denoising , 2020, WSDM.

[4]  Nuno Vasconcelos,et al.  Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, ArXiv.

[5]  Kyunghyun Cho,et al.  A Framework For Contrastive Self-Supervised Learning And Designing A New Approach , 2020, ArXiv.

[6]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[7]  Shu Wu,et al.  A Graph-based Relevance Matching Model for Ad-hoc Retrieval , 2021, AAAI.

[8]  Wei Hu,et al.  Exploring Structure-Adaptive Graph Learning for Robust Semi-Supervised Classification , 2019, 2020 IEEE International Conference on Multimedia and Expo (ICME).

[9]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[10]  Kilian Q. Weinberger,et al.  Simplifying Graph Convolutional Networks , 2019, ICML.

[11]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[12]  Yixin Chen,et al.  Link Prediction Based on Graph Neural Networks , 2018, NeurIPS.

[13]  Yu Chen,et al.  Iterative Deep Graph Learning for Graph Neural Networks: Better and Robust Node Embeddings , 2019, NeurIPS.

[14]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[15]  Bo Zong,et al.  Robust Graph Representation Learning via Neural Sparsification , 2020, ICML.

[16]  Yongfeng Zhang,et al.  Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable Recommendation , 2019, SIGIR.

[17]  Xiangliang Zhang,et al.  Self-Supervised Multi-Channel Hypergraph Convolutional Network for Social Recommendation , 2021, ArXiv.

[18]  Xiangnan He,et al.  Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention , 2017, SIGIR.

[19]  Michael Gasser,et al.  The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.

[20]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[21]  Qiang Liu,et al.  Deep Graph Contrastive Representation Learning , 2020, ArXiv.

[22]  Vladimir Risojevic,et al.  Self-Supervised Learning of Remote Sensing Scene Representations Using Contrastive Multiview Coding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Xing Xie,et al.  Session-based Recommendation with Graph Neural Networks , 2018, AAAI.

[24]  Bernard Ghanem,et al.  Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[25]  Julian J. McAuley,et al.  VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback , 2015, AAAI.

[26]  Jianguo Wang,et al.  Sherlock: Sparse Hierarchical Embeddings for Visually-Aware One-Class Collaborative Filtering , 2016, IJCAI.

[27]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[28]  Julian J. McAuley,et al.  Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering , 2016, WWW.

[29]  Shu Wu,et al.  Mining Latent Structures for Multimedia Recommendation , 2021, ACM Multimedia.

[30]  Suhang Wang,et al.  Graph Structure Learning for Robust Graph Neural Networks , 2020, KDD.

[31]  Xiangnan He,et al.  Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback , 2020, ACM Multimedia.

[32]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[33]  Liang Wang,et al.  DeepStyle: Learning User Preferences for Visual Recommendation , 2017, SIGIR.

[34]  Ji-Rong Wen,et al.  S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization , 2020, CIKM.

[35]  Zhangyang Wang,et al.  Graph Contrastive Learning with Augmentations , 2020, NeurIPS.

[36]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[37]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[38]  Charu C. Aggarwal,et al.  Recommender Systems: The Textbook , 2016 .

[39]  Chunyan Miao,et al.  Pre-training Graph Transformer with Multimodal Side Information for Recommendation , 2021, ACM Multimedia.

[40]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[41]  Chen Fang,et al.  Visually-Aware Fashion Recommendation and Design with Generative Image Models , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[42]  Yongdong Zhang,et al.  LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation , 2020, SIGIR.

[43]  Serge J. Belongie,et al.  Learning Visual Clothing Style with Heterogeneous Dyadic Co-Occurrences , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Shu Wu,et al.  Deep Graph Structure Learning for Robust Representations: A Survey , 2021, ArXiv.

[45]  Xiaochun Cao,et al.  Topology Optimization based Graph Convolutional Network , 2019, IJCAI.

[46]  Yonglong Tian,et al.  Contrastive Representation Distillation , 2019, ICLR.

[47]  Massimiliano Pontil,et al.  Learning Discrete Structures for Graph Neural Networks , 2019, ICML.

[48]  Shu Wu,et al.  Disentangled Item Representation for Recommender Systems , 2021, ACM Trans. Intell. Syst. Technol..

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Ruoyu Li,et al.  Adaptive Graph Convolutional Neural Networks , 2018, AAAI.

[51]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[53]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[54]  Qiang Liu,et al.  MV-RNN: A Multi-View Recurrent Neural Network for Sequential Recommendation , 2016, IEEE Transactions on Knowledge and Data Engineering.

[55]  Xiangnan He,et al.  MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video , 2019, ACM Multimedia.

[56]  Jianxun Lian,et al.  Self-supervised Graph Learning for Recommendation , 2020, SIGIR.

[57]  Xiao Wang,et al.  AM-GCN: Adaptive Multi-channel Graph Convolutional Networks , 2020, KDD.

[58]  Shu Wu,et al.  An Empirical Study of Graph Contrastive Learning , 2021, NeurIPS Datasets and Benchmarks.

[59]  Tat-Seng Chua,et al.  Neural Graph Collaborative Filtering , 2019, SIGIR.

[60]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[61]  Yousef Saad,et al.  Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection , 2009, J. Mach. Learn. Res..

[62]  Xiangnan He,et al.  Hierarchical Fashion Graph Network for Personalized Outfit Recommendation , 2020, SIGIR.

[63]  Heng Ji,et al.  Separating Skills and Concepts for Novel Visual Question Answering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[65]  Qiang Liu,et al.  Graph Contrastive Learning with Adaptive Augmentation , 2020, WWW.