Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition

Recently, food recognition has received more and more attention in image processing and computer vision for its great potential applications in human health. Most of the existing methods directly extracted deep visual features via convolutional neural networks (CNNs) for food recognition. Such methods ignore the characteristics of food images and are, thus, hard to achieve optimal recognition performance. In contrast to general object recognition, food images typically do not exhibit distinctive spatial arrangement and common semantic patterns. In this paper, we propose a multi-scale multi-view feature aggregation (MSMVFA) scheme for food recognition. MSMVFA can aggregate high-level semantic features, mid-level attribute features, and deep visual features into a unified representation. These three types of features describe the food image from different granularity. Therefore, the aggregated features can capture the semantics of food images with the greatest probability. For that solution, we utilize additional ingredient knowledge to obtain mid-level attribute representation via ingredient-supervised CNNs. High-level semantic features and deep visual features are extracted from class-supervised CNNs. Considering food images do not exhibit distinctive spatial layout in many cases, MSMVFA fuses multi-scale CNN activations for each type of features to make aggregated features more discriminative and invariable to geometrical deformation. Finally, the aggregated features are more robust, comprehensive, and discriminative via two-level fusion, namely multi-scale fusion for each type of features and multi-view aggregation for different types of features. In addition, MSMVFA is general and different deep networks can be easily applied into this scheme. Extensive experiments and evaluations demonstrate that our method achieves state-of-the-art recognition performance on three popular large-scale food benchmark datasets in Top-1 recognition accuracy. Furthermore, we expect this paper will further the agenda of food recognition in the community of image processing and computer vision.

[1]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[2]  Shuqiang Jiang,et al.  A Delicious Recipe Analysis Framework for Exploring Multi-Modal Recipes with Various Attributes , 2017, ACM Multimedia.

[3]  Yong Rui,et al.  You Are What You Eat: Exploring Rich Recipe Information for Cross-Region Food Analysis , 2018, IEEE Transactions on Multimedia.

[4]  Luis Herranz,et al.  Scene Recognition with CNNs: Objects, Scales and Dataset Bias , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ramesh C. Jain,et al.  A Survey on Food Computing , 2018, ACM Comput. Surv..

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Rose Qingyang Hu,et al.  Sensor-Based Human Activity Recognition for Smart Healthcare: A Semi-supervised Machine Learning , 2019, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering.

[8]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[9]  Beatriz Remeseiro,et al.  Grab, Pay, and Eat: Semantic Food Detection for Smart Restaurants , 2018, IEEE Transactions on Multimedia.

[10]  Long Bao Le,et al.  Wireless Communications and Mobile Computing 1 Joint Data Compression and Mac Protocol Design for Smartgrids with Renewable Energy , 2022 .

[11]  Neel Joshi,et al.  Menu-Match: Restaurant-Specific Food Logging from Images , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[12]  Chong-Wah Ngo,et al.  Deep-based Ingredient Recognition for Cooking Recipe Retrieval , 2016, ACM Multimedia.

[13]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Monica Mordonini,et al.  Food Image Recognition Using Very Deep Convolutional Networks , 2016, MADiMa @ ACM Multimedia.

[15]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Xin Chen,et al.  ChineseFoodNet: A large-scale Image Dataset for Chinese Food Recognition , 2017, ArXiv.

[18]  Keiji Yanai,et al.  Food image recognition using deep convolutional network with pre-training and fine-tuning , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[19]  Xinhang Song,et al.  Multi-Scale Multi-Feature Context Modeling for Scene Recognition in the Semantic Manifold , 2017, IEEE Transactions on Image Processing.

[20]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[21]  Weilin Huang,et al.  CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images , 2018, ECCV.

[22]  Petia Radeva,et al.  Food Ingredients Recognition Through Multi-label Learning , 2017, ICIAP Workshops.

[23]  Shuang Wang,et al.  Geolocalized Modeling for Dish Recognition , 2015, IEEE Transactions on Multimedia.

[24]  Luis Herranz,et al.  Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration , 2017, IEEE Transactions on Multimedia.

[25]  Giovanni Maria Farinella,et al.  Classifying food images represented as Bag of Textons , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[26]  Gian Luca Foresti,et al.  Wide-Slice Residual Networks for Food Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[27]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[28]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Vinod Vokkarane,et al.  DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment , 2016, ICOST.

[30]  Mei Chen,et al.  Food recognition using statistics of pairwise local features , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Michael Elad,et al.  Multi-Scale Patch-Based Image Restoration , 2016, IEEE Transactions on Image Processing.

[33]  Rose Qingyang Hu,et al.  Mobility-Aware Edge Caching and Computing in Vehicle Networks: A Deep Reinforcement Learning , 2018, IEEE Transactions on Vehicular Technology.

[34]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  A. Emin Orhan,et al.  A Simple Cache Model for Image Recognition , 2018, NeurIPS.

[36]  Yuxin Peng,et al.  Object-Part Attention Model for Fine-Grained Image Classification , 2017, IEEE Transactions on Image Processing.

[37]  Sergio Guadarrama,et al.  Im2Calories: Towards an Automated Mobile Vision Food Diary , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[39]  Tao Mei,et al.  Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Xiangyang Li,et al.  Where and What to Eat: Simultaneous Restaurant and Dish Recognition from Food Image , 2016, PCM.

[41]  Gregory D. Abowd,et al.  Leveraging Context to Support Automated Food Recognition in Restaurants , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[42]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[43]  Chong-Wah Ngo,et al.  Cross-Modal Recipe Retrieval: How to Cook this Dish? , 2017, MMM.

[44]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[45]  Gian Luca Foresti,et al.  A Structured Committee for Food Recognition , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[46]  Yuxin Peng,et al.  StackDRL: Stacked Deep Reinforcement Learning for Fine-grained Visual Categorization , 2018, IJCAI.

[47]  Xiao Liu,et al.  Fully Convolutional Attention Networks for Fine-Grained Recognition , 2016 .

[48]  Feng Zhou,et al.  Fine-Grained Image Classification by Exploring Bipartite-Graph Labels , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[50]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Yizhou Yu,et al.  Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Makoto Ogawa,et al.  Food Detection and Recognition Using Convolutional Neural Network , 2014, ACM Multimedia.