论文信息 - Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

In this paper, we introduce Recipe1M+, a new large-scale, structured corpus of over one million cooking recipes and 13 million food images. As the largest publicly available collection of recipe data, Recipe1M+ affords the ability to train high-capacity models on aligned, multimodal data. Using these data, we train a neural network to learn a joint embedding of recipes and images that yields impressive results on an image-recipe retrieval task. Moreover, we demonstrate that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate that these embeddings will provide a basis for further exploration of the Recipe1M+ dataset and food and cooking in general. Code, data and models are publicly available.<xref rid="fn1" ref-type="fn"><sup>1</sup></xref><fn id="fn1"><label>1.</label><p><uri>http://im2recipe.csail.mit.edu</uri>.</p> </fn>

[1] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[2] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4] Keiji Yanai,et al. FoodCam: A real-time food recognition system on a smartphone , 2015, Multimedia Tools and Applications.

[5] Matthieu Guillaumin,et al. Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[6] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[7] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[8] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[9] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[10] Sergio Guadarrama,et al. Im2Calories: Towards an Automated Mobile Vision Food Diary , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[12] Matthieu Cord,et al. Recipe recognition with large multimodal food dataset , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[13] Shuang Wang,et al. Geolocalized Modeling for Dish Recognition , 2015, IEEE Transactions on Multimedia.

[14] Bolei Zhou,et al. Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[15] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[17] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18] Christoph Trattner,et al. Understanding and Predicting Online Food Recipe Production Patterns , 2016, HT.

[19] Sofiane Abbar,et al. Fetishizing Food in Digital Age: #foodporn Around the World , 2016, ICWSM.

[20] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] A. Torralba,et al. Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Venkata Rama Kiran Garimella,et al. Social Media Image Analysis for Public Health , 2015, CHI.

[23] Soumith Chintala,et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[24] Chong-Wah Ngo,et al. Deep-based Ingredient Recognition for Cooking Recipe Retrieval , 2016, ACM Multimedia.

[25] Vinod Vokkarane,et al. DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment , 2016, ICOST.

[26] Chong-Wah Ngo,et al. Cross-modal Recipe Retrieval with Rich Food Attributes , 2017, ACM Multimedia.

[27] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Amaia Salvador,et al. Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Shuqiang Jiang,et al. A Delicious Recipe Analysis Framework for Exploring Multi-Modal Recipes with Various Attributes , 2017, ACM Multimedia.

[30] Antonio Torralba,et al. Is Saki #delicious?: The Food Perception Gap on Instagram and Its Relation to Health , 2017, WWW.

[31] Luis Herranz,et al. Food recognition and recipe analysis: integrating visual content, context and external knowledge , 2018, ArXiv.

[32] Maneesh Agrawala,et al. RecipeScape: An Interactive Tool for Analyzing Cooking Instructions at Scale , 2018, CHI.

[33] Martin Engilberge,et al. Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] Antonio Torralba,et al. Cross-Modal Scene Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Matthieu Cord,et al. Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings , 2018, SIGIR.

[36] Chong-Wah Ngo,et al. Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval , 2018, ACM Multimedia.

[37] Bolei Zhou,et al. Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.