Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Images

Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle. An important task under the food-computing umbrella is retrieval, which is particularly helpful for health related applications, where we are interested in retrieving important information about food (e.g., ingredients, nutrition, etc.). In this paper, we investigate an open research task of cross-modal retrieval between cooking recipes and food images, and propose a novel framework Adversarial Cross-Modal Embedding (ACME) to resolve the cross-modal retrieval task in food domains. Specifically, the goal is to learn a common embedding feature space between the two modalities, in which our approach consists of several novel ideas: (i) learning by using a new triplet loss scheme together with an effective sampling strategy, (ii) imposing modality alignment using an adversarial learning strategy, and (iii) imposing cross-modal translation consistency such that the embedding of one modality is able to recover some important information of corresponding instances in the other modality. ACME achieves the state-of-the-art performance on the benchmark Recipe1M dataset, validating the efficacy of the proposed technique.

[1]  Taesung Park,et al.  CyCADA: Cycle-Consistent Adversarial Domain Adaptation , 2017, ICML.

[2]  Yuxin Peng,et al.  CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[3]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[4]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[5]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Yueting Zhuang,et al.  Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment , 2015, ACM Multimedia.

[8]  Matthieu Cord,et al.  Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings , 2018, SIGIR.

[9]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[10]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[12]  Anil A. Bharath,et al.  Adversarial Training for Sketch Retrieval , 2016, ECCV Workshops.

[13]  Ee-Peng Lim,et al.  Does Journaling Encourage Healthier Choices?: Analyzing Healthy Eating Behaviors of Food Journalers , 2018, DH.

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  Ramesh C. Jain,et al.  A Survey on Food Computing , 2018, ACM Comput. Surv..

[16]  Giovanni Maria Farinella,et al.  Retrieval and classification of food images , 2016, Comput. Biol. Medicine.

[17]  Keiji Yanai,et al.  Image-Based Food Calorie Estimation Using Knowledge on Food Categories, Ingredients and Cooking Directions , 2017, ACM Multimedia.

[18]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Ole G. Mouritsen,et al.  Flavour of fermented fish, insect, game, and pea sauces: Garum revisited , 2017 .

[20]  Antonio Torralba,et al.  Is Saki #delicious?: The Food Perception Gap on Instagram and Its Relation to Health , 2017, WWW.

[21]  C. Batt Food Pathogen Detection , 2007, Science.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[25]  Keiji Yanai,et al.  A food image recognition system with Multiple Kernel Learning , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[26]  Ahmed Fadhil,et al.  Can a Chatbot Determine My Diet?: Addressing Challenges of Chatbot Application for Meal Recommendation , 2018, ArXiv.

[27]  Gianluca Stringhini,et al.  Kissing Cuisines: Exploring Worldwide Culinary Habits on the Web , 2016, WWW.

[28]  Sajjad Ahmad Madani,et al.  Diet-Right: A Smart Food Recommendation System , 2017, KSII Trans. Internet Inf. Syst..

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Ee-Peng Lim,et al.  Eat & Tell: A Randomized Trial of Random-Loss Incentive to Increase Dietary Self-Tracking Compliance , 2018, DH.

[31]  Keiji Yanai,et al.  Food image recognition using deep convolutional network with pre-training and fine-tuning , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[32]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[34]  Hang Li,et al.  Learning Similarity Function between Objects in Heterogeneous Spaces , 2010 .

[35]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Maneesh Agrawala,et al.  RecipeScape: An Interactive Tool for Analyzing Cooking Instructions at Scale , 2018, CHI.

[37]  Chong-Wah Ngo,et al.  Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval , 2018, ACM Multimedia.

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Paolo Napoletano,et al.  Learning CNN-based Features for Retrieval of Food Images , 2017, ICIAP Workshops.

[40]  Keiji Yanai,et al.  Food image recognition with deep convolutional features , 2014, UbiComp Adjunct.

[41]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[42]  Severino Feliciano Morales,et al.  Search for Optimum Color Space for the Recognition of Oranges in Agricultural Fields , 2017, CITI.

[43]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[44]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[45]  Chong-Wah Ngo,et al.  Cross-Modal Recipe Retrieval: How to Cook this Dish? , 2017, MMM.

[46]  Vinod Vokkarane,et al.  DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment , 2016, ICOST.