Multi-subspace Implicit Alignment for Cross-modal Retrieval on Cooking Recipes and Food Images

Cross-modal retrieval technology can help people quickly achieve mutual information between cooking recipes and food images. Both the embeddings of the image and the recipe consist of multiple representation subspaces. We argue that multiple aspects in the recipe are related to multiple regions in the food image. It is challenging to improve the cross-modal retrieval quality by making full use of the implicit connection between multiple subspaces of recipes and images. In this paper, we propose a multi-subspace implicit alignment cross-modal retrieval framework of recipes and images. Our framework learns multi-subspace information about cooking recipes and food images with multi-head attention networks; the implicit alignment at the subspace level promotes narrowing the semantic gap between recipe embeddings and food image embeddings; triple loss and adversarial loss are combined to help our framework for cross-modal learning. The experimental results show that our framework significantly outperforms to state-of-the-art methods in terms of MedR and R@K on Recipe 1M.

[1]  Antonio Torralba,et al.  Is Saki #delicious?: The Food Perception Gap on Instagram and Its Relation to Health , 2017, WWW.

[2]  Wei-Yun Yau,et al.  Structured AutoEncoders for Subspace Clustering , 2018, IEEE Transactions on Image Processing.

[3]  W. Bruce Croft,et al.  Recipe Retrieval with Visual Query of Ingredients , 2020, SIGIR.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  Ee-Peng Lim,et al.  Does Journaling Encourage Healthier Choices?: Analyzing Healthy Eating Behaviors of Food Journalers , 2018, DH.

[7]  Gian Luca Foresti,et al.  Wide-Slice Residual Networks for Food Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[9]  Sofiane Abbar,et al.  Fetishizing Food in Digital Age: #foodporn Around the World , 2016, ICWSM.

[10]  Lin Li,et al.  Sentence-based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images , 2020, ICMR.

[11]  ANR , 2018, Proceedings of the 27th ACM International Conference on Information and Knowledge Management.

[12]  Xuelong Li,et al.  Multi-view Subspace Clustering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Bing Liu,et al.  Aspect Based Recommendations: Recommending Items with the Most Valuable Aspects Based on User Reviews , 2017, KDD.

[14]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[15]  Chong-Wah Ngo,et al.  Cross-Modal Recipe Retrieval: How to Cook this Dish? , 2017, MMM.

[16]  Shervin Shirmohammadi,et al.  Mobile Multi-Food Recognition Using Deep Learning , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[17]  Marie Katsurai,et al.  Recipe Popularity Prediction with Deep Visual-Semantic Fusion , 2017, CIKM.

[18]  Petia Radeva,et al.  Uncertainty-Aware Data Augmentation for Food Recognition , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[19]  Eugene Agichtein,et al.  Did You Really Just Have a Heart Attack?: Towards Robust Detection of Personal Health Mentions in Social Media , 2018, WWW.

[20]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xiangnan He,et al.  Heterogeneous Fusion of Semantic and Collaborative Information for Visually-Aware Food Recommendation , 2020, ACM Multimedia.

[22]  C. Forde,et al.  Sensory influences on food intake control: moving beyond palatability , 2016, Obesity reviews : an official journal of the International Association for the Study of Obesity.

[23]  Matthieu Cord,et al.  Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings , 2018, SIGIR.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Håkan Jönsson,et al.  Food and health: individual, cultural, or scientific matters? , 2013, Genes & Nutrition.

[26]  Giovanni Maria Farinella,et al.  Retrieval and classification of food images , 2016, Comput. Biol. Medicine.

[27]  Steven C. H. Hoi,et al.  Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Shafiq R. Joty,et al.  ANR: Aspect-based Neural Recommender , 2018, CIKM.

[32]  BattiatoSebastiano,et al.  Retrieval and classification of food images , 2016 .

[33]  Jianling Sun,et al.  MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Chong-Wah Ngo,et al.  Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval , 2018, ACM Multimedia.