A Large-Scale Benchmark for Food Image Segmentation

Food image segmentation is a critical and indispensible task for developing health-related applications such as estimating food calories and nutrients. Existing food image segmentation models are underperforming due to two reasons: (1) there is a lack of high quality food image datasets with fine-grained ingredient labels and pixel-wise location masks---the existing datasets either carry coarse ingredient labels or are small in size; and (2) the complex appearance of food makes it difficult to localize and recognize ingredients in food images, e.g., the ingredients may overlap one another in the same image, and the identical ingredient may appear distinctly in different food images. In this work, we build a new food image dataset FoodSeg103 (and its extension FoodSeg154) containing 9,490 images. We annotate these images with 154 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks. In addition, we propose a multi-modality pre-training approach called ReLeM that explicitly equips a segmentation model with rich and semantic food knowledge. In experiments, we use three popular semantic segmentation methods (i.e., Dilated Convolution based[20], Feature Pyramid based[25], and Vision Transformer based[60] ) as baselines, and evaluate them as well as ReLeM on our new datasets. We believe that the FoodSeg103 (and its extension FoodSeg154) and the pre-trained models using ReLeM can serve as a benchmark to facilitate future works on fine-grained food image understanding. We make all these datasets and methods public at https://xiongweiwu.github.io/foodseg103.html.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Shuang Wang,et al.  Geolocalized Modeling for Dish Recognition , 2015, IEEE Transactions on Multimedia.

[5]  Hedy Kober,et al.  Training in cognitive strategies reduces eating and improves food choice , 2018, Proceedings of the National Academy of Sciences.

[6]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Paolo Napoletano,et al.  Learning CNN-based Features for Retrieval of Food Images , 2017, ICIAP Workshops.

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Arjun Karpur,et al.  Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Feng Zhou,et al.  Fine-Grained Image Classification by Exploring Bipartite-Graph Labels , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tat-Seng Chua,et al.  Mixed-dish Recognition with Contextual Relation Networks , 2019, ACM Multimedia.

[13]  Zhiling Wang,et al.  ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network , 2020, ACM Multimedia.

[14]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Kaiming He,et al.  Panoptic Feature Pyramid Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Lei Yang,et al.  PFID: Pittsburgh fast-food image dataset , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[17]  Chong-Wah Ngo,et al.  Mixed Dish Recognition through Multi-Label Learning , 2019, CEA@ICMR.

[18]  Giovanni Maria Farinella,et al.  A Benchmark Dataset to Study the Representation of Food Images , 2014, ECCV Workshops.

[19]  Keiji Yanai,et al.  UEC-FoodPix Complete: A Large-Scale Food Image Segmentation Dataset , 2020, ICPR Workshops.

[20]  Shuqiang Jiang,et al.  Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition , 2019, ACM Multimedia.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Keiji Yanai,et al.  Multiple-food recognition considering co-occurrence employing manifold ranking , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[24]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[25]  Keiji Yanai,et al.  Automatic Expansion of a Food Image Dataset Leveraging Existing Categories with Domain Adaptation , 2014, ECCV Workshops.

[26]  Philips Kokoh Prasetyo,et al.  RecipeGPT: Generative Pre-training Based Cooking Recipe Generation and Evaluation System , 2020, WWW.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Antonio Torralba,et al.  Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Amaia Salvador,et al.  Inverse Cooking: Recipe Generation From Food Images , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  D. Tilman,et al.  Global diets link environmental sustainability and human health , 2014, Nature.

[33]  Zhang Hanwang,et al.  Self-Regulation for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[35]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Wataru Shimoda,et al.  Learning Food Image Similarity for Food Image Retrieval , 2017, 2017 IEEE Third International Conference on Multimedia Big Data (BigMM).

[37]  Zhenguang Liu,et al.  Combining Graph Neural Networks With Expert Knowledge for Smart Contract Vulnerability Detection , 2021, IEEE Transactions on Knowledge and Data Engineering.

[38]  Steven C. H. Hoi,et al.  FoodAI: Food Image Recognition via Deep Learning for Smart Food Logging , 2019, KDD.

[39]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[40]  Chunyan Miao,et al.  Structure-Aware Generation Network for Recipe Generation from Images , 2020, ECCV.

[41]  Sergio Guadarrama,et al.  Im2Calories: Towards an Automated Mobile Vision Food Diary , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Marios Anthimopoulos,et al.  A Food Recognition System for Diabetic Patients Based on an Optimized Bag-of-Features Model , 2014, IEEE Journal of Biomedical and Health Informatics.

[43]  John R. Smith,et al.  Snap, Eat, RepEat: A Food Recognition Engine for Dietary Logging , 2016, MADiMa @ ACM Multimedia.

[44]  Siyao Wang,et al.  Mining Discriminative Food Regions for Accurate Food Recognition , 2019, BMVC.

[45]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Chong-Wah Ngo,et al.  Deep-based Ingredient Recognition for Cooking Recipe Retrieval , 2016, ACM Multimedia.

[47]  Keiji Yanai,et al.  Image Recognition of 85 Food Categories by Feature Fusion , 2010, 2010 IEEE International Symposium on Multimedia.

[48]  Matthieu Cord,et al.  Recipe recognition with large multimodal food dataset , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[49]  Ajay Divakaran,et al.  FoodX-251: A Dataset for Fine-grained Food Classification , 2019, ArXiv.

[50]  Wataru Shimoda,et al.  A New Large-scale Food Image Segmentation Dataset and Its Application to Food Calorie Estimation Based on Grains of Rice , 2019, MADiMa @ ACM Multimedia.

[51]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[53]  Keiji Yanai,et al.  A food image recognition system with Multiple Kernel Learning , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[54]  Jinhui Tang,et al.  Feature Pyramid Transformer , 2020, ECCV.

[55]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  Wen Tang,et al.  MUSEFood: Multi-Sensor-Based Food Volume Estimation on Smartphones , 2019, 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[58]  Touradj Ebrahimi,et al.  Food/Non-food Image Classification and Food Categorization using Pre-Trained GoogLeNet Model , 2016, MADiMa @ ACM Multimedia.

[59]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[60]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).