论文信息 - Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

In this paper, we introduce Recipe1M, a new large-scale, structured corpus of over 1m cooking recipes and 800k food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data. Using these data, we train a neural network to find a joint embedding of recipes and images that yields impressive results on an image-recipe retrieval task. Additionally, we demonstrate that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate that these embeddings will provide a basis for further exploration of the Recipe1M dataset and food and cooking in general. Code, data and models are publicly available

[1] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[2] Shuqiang Jiang,et al. A Delicious Recipe Analysis Framework for Exploring Multi-Modal Recipes with Various Attributes , 2017, ACM Multimedia.

[3] Matthieu Cord,et al. Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings , 2018, SIGIR.

[4] Chong-Wah Ngo,et al. Cross-modal Recipe Retrieval with Rich Food Attributes , 2017, ACM Multimedia.

[5] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[6] Luis Herranz,et al. Food recognition and recipe analysis: integrating visual content, context and external knowledge , 2018, ArXiv.

[7] Antonio Torralba,et al. Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Matthieu Guillaumin,et al. Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[9] Shuang Wang,et al. Geolocalized Modeling for Dish Recognition , 2015, IEEE Transactions on Multimedia.

[10] Antonio Torralba,et al. Is Saki #delicious?: The Food Perception Gap on Instagram and Its Relation to Health , 2017, WWW.

[11] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12] Maneesh Agrawala,et al. RecipeScape: An Interactive Tool for Analyzing Cooking Instructions at Scale , 2018, CHI.

[13] Chong-Wah Ngo,et al. Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval , 2018, ACM Multimedia.

[14] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Keiji Yanai,et al. FoodCam: A real-time food recognition system on a smartphone , 2015, Multimedia Tools and Applications.

[16] Martin Engilberge,et al. Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Bolei Zhou,et al. Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[19] Sergio Guadarrama,et al. Im2Calories: Towards an Automated Mobile Vision Food Diary , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20] Chong-Wah Ngo,et al. Deep-based Ingredient Recognition for Cooking Recipe Retrieval , 2016, ACM Multimedia.

[21] Vinod Vokkarane,et al. DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment , 2016, ICOST.

[22] Matthieu Cord,et al. Recipe recognition with large multimodal food dataset , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[23] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[24] Christoph Trattner,et al. Understanding and Predicting Online Food Recipe Production Patterns , 2016, HT.

[25] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26] Soumith Chintala,et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[27] Antonio Torralba,et al. Cross-Modal Scene Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[29] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[30] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[31] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[34] Bolei Zhou,et al. Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Sofiane Abbar,et al. Fetishizing Food in Digital Age: #foodporn Around the World , 2016, ICWSM.

[36] Venkata Rama Kiran Garimella,et al. Social Media Image Analysis for Public Health , 2015, CHI.