Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition

Recently, food recognition is gaining more attention in the multimedia community due to its various applications, e.g., multimodal foodlog and personalized healthcare. Most of existing methods directly extract visual features of the whole image using popular deep networks for food recognition without considering its own characteristics. Compared with other types of object images, food images generally do not exhibit distinctive spatial arrangement and common semantic patterns, and thus are very hard to capture discriminative information. In this work, we achieve food recognition by developing an Ingredient-Guided Cascaded Multi-Attention Network (IG-CMAN), which is capable of sequentially localizing multiple informative image regions with multi-scale from category-level to ingredient-level guidance in a coarse-to-fine manner. At the first level, IG-CMAN generates the initial attentional region from the category-supervised network with Spatial Transformer (ST). Taking this localized attentional region as the reference, IG-CMAN combined ST with LSTM to sequentially discover diverse attentional regions with fine-grained scales from ingredient-guided sub-network in the following levels. Furthermore, we introduce a new dataset ISIA Food-200 with 200 food categories from the list in the Wikipedia, about 200,000 food images and 319 ingredients. We conducted extensive experiment on two popular food datasets and newly proposed ISIA Food-200, and verified the effectiveness of our method. Qualitative results along with visualization further show that IG-CMAN can introduce the explainability for localized regions, and is able to learn relevant regions for ingredients.

[1]  Luis Herranz,et al.  Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration , 2017, IEEE Transactions on Multimedia.

[2]  Keiji Yanai,et al.  Food image recognition using deep convolutional network with pre-training and fine-tuning , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[3]  Shuqiang Jiang,et al.  Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition , 2020, IEEE Transactions on Image Processing.

[4]  Tao Mei,et al.  Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Gianluca Stringhini,et al.  Kissing Cuisines: Exploring Worldwide Culinary Habits on the Web , 2016, WWW.

[6]  Xiangyang Li,et al.  Where and What to Eat: Simultaneous Restaurant and Dish Recognition from Food Image , 2016, PCM.

[7]  Shuang Wang,et al.  Geolocalized Modeling for Dish Recognition , 2015, IEEE Transactions on Multimedia.

[8]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[9]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[10]  Kiyoharu Aizawa [Invited Paper] Multimedia FoodLog: Diverse Applications from Self-Monitoring to Social Contributions , 2013 .

[11]  Ramesh C. Jain,et al.  Health Multimedia: Lifestyle Recommendations Based on Diverse Observations , 2017, ICMR.

[12]  Weilin Huang,et al.  CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images , 2018, ECCV.

[13]  Petia Radeva,et al.  Food Ingredients Recognition Through Multi-label Learning , 2017, ICIAP Workshops.

[14]  Sergio Guadarrama,et al.  Im2Calories: Towards an Automated Mobile Vision Food Diary , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Luis Herranz,et al.  Modeling Restaurant Context for Food Recognition , 2017, IEEE Transactions on Multimedia.

[16]  Jiong Wang,et al.  Attention-based Pyramid Aggregation Network for Visual Place Recognition , 2018, ACM Multimedia.

[17]  Vinod Vokkarane,et al.  DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment , 2016, ICOST.

[18]  Ramesh C. Jain,et al.  A Survey on Food Computing , 2018, ACM Comput. Surv..

[19]  Keiji Yanai,et al.  FoodCam-256: A Large-scale Real-time Mobile Food RecognitionSystem employing High-Dimensional Features and Compression of Classifier Weights , 2014, ACM Multimedia.

[20]  Chong-Wah Ngo,et al.  Cross-modal Recipe Retrieval with Rich Food Attributes , 2017, ACM Multimedia.

[21]  Chong-Wah Ngo,et al.  Deep-based Ingredient Recognition for Cooking Recipe Retrieval , 2016, ACM Multimedia.

[22]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Monica Mordonini,et al.  Food Image Recognition Using Very Deep Convolutional Networks , 2016, MADiMa @ ACM Multimedia.

[24]  Gian Luca Foresti,et al.  Wide-Slice Residual Networks for Food Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[26]  Makoto Ogawa,et al.  Food Detection and Recognition Using Convolutional Neural Network , 2014, ACM Multimedia.

[27]  Xiao Liu,et al.  Fully Convolutional Attention Networks for Fine-Grained Recognition , 2016 .

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Feng Zhou,et al.  Fine-Grained Image Classification by Exploring Bipartite-Graph Labels , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Liang Lin,et al.  Multi-label Image Recognition by Recurrently Discovering Attentional Regions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Mei Chen,et al.  Food recognition using statistics of pairwise local features , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  Keiji Yanai,et al.  DeepFoodCam: A DCNN-based Real-time Mobile Food Recognition System , 2016, MADiMa @ ACM Multimedia.

[34]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Houqiang Li,et al.  Local Convolutional Neural Networks for Person Re-Identification , 2018, ACM Multimedia.

[36]  Kiyoharu Aizawa,et al.  Personalized Classifier for Food Image Recognition , 2018, IEEE Transactions on Multimedia.

[37]  Yong Rui,et al.  You Are What You Eat: Exploring Rich Recipe Information for Cross-Region Food Analysis , 2018, IEEE Transactions on Multimedia.

[38]  Shuqiang Jiang,et al.  A Delicious Recipe Analysis Framework for Exploring Multi-Modal Recipes with Various Attributes , 2017, ACM Multimedia.

[39]  Chong-Wah Ngo,et al.  Food Photo Recognition for Dietary Tracking: System and Experiment , 2018, MMM.