Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings

Designing powerful tools that support cooking activities has rapidly gained popularity due to the massive amounts of available data, as well as recent advances in machine learning that are capable of analyzing them. In this paper, we propose a cross-modal retrieval model aligning visual and textual data (like pictures of dishes and their recipes) in a shared representation space. We describe an effective learning scheme, capable of tackling large-scale problems, and validate it on the Recipe1M dataset containing nearly 1 million picture-recipe pairs. We show the effectiveness of our approach regarding previous state-of-the-art models and present qualitative results over computational cooking use cases.

[1]  Chong-Wah Ngo,et al.  Deep-based Ingredient Recognition for Cooking Recipe Retrieval , 2016, ACM Multimedia.

[2]  Matthieu Cord,et al.  Recipe recognition with large multimodal food dataset , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[3]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4]  Sourav S. Bhowmick,et al.  Tag-based social image retrieval: An empirical evaluation , 2011, J. Assoc. Inf. Sci. Technol..

[5]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Raffaele Perego,et al.  Social Media Image Recognition for Food Trend Analysis , 2017, SIGIR.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Marie Katsurai,et al.  Recipe Popularity Prediction with Deep Visual-Semantic Fusion , 2017, CIKM.

[9]  Lei Yang,et al.  PFID: Pittsburgh fast-food image dataset , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[10]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[11]  Giovanni Maria Farinella,et al.  A Benchmark Dataset to Study the Representation of Food Images , 2014, ECCV Workshops.

[12]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[16]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[19]  Christoph Trattner,et al.  Investigating the Healthiness of Internet-Sourced Recipes: Implications for Meal Planning and Recommender Systems , 2017, WWW.

[20]  Christoph Trattner,et al.  Understanding and Predicting Online Food Recipe Production Patterns , 2016, HT.

[21]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[22]  Chong-Wah Ngo,et al.  Cross-Modal Recipe Retrieval: How to Cook this Dish? , 2017, MMM.

[23]  Neel Joshi,et al.  Menu-Match: Restaurant-Specific Food Logging from Images , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[24]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[25]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[26]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[27]  Jun Harashima,et al.  Cookpad Image Dataset: An Image Collection as Infrastructure for Food Research , 2017, SIGIR.

[28]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[29]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[31]  Keiji Yanai,et al.  FoodCam: A Real-Time Mobile Food Recognition System Employing Fisher Vector , 2014, MMM.

[32]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Hongbin Zha,et al.  Joint Latent Subspace Learning and Regression for Cross-Modal Retrieval , 2017, SIGIR.

[34]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[35]  Kjetil Nørvåg,et al.  Online Food Recipe Title Semantics: Combining Nutrient Facts and Topics , 2016, CIKM.

[36]  Christoph Trattner,et al.  Exploiting Food Choice Biases for Healthier Recipe Recommendation , 2017, SIGIR.

[37]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Jiwen Lu,et al.  Discriminative Deep Metric Learning for Face Verification in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Matthieu Cord,et al.  Quadruplet-Wise Image Similarity Learning , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Keiji Yanai,et al.  Food image recognition with deep convolutional features , 2014, UbiComp Adjunct.

[42]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.