Egocentric Image Captioning for Privacy-Preserved Passive Dietary Intake Monitoring

Camera-based passive dietary intake monitoring is able to continuously capture the eating episodes of a subject, recording rich visual information, such as the type and volume of food being consumed, as well as the eating behaviours of the subject. However, there currently is no method that is able to incorporate these visual clues and provide a comprehensive context of dietary intake from passive recording (e.g., is the subject sharing food with others, what food the subject is eating, and how much food is left in the bowl). On the other hand, privacy is a major concern while egocentric wearable cameras are used for capturing. In this paper, we propose a privacypreserved secure solution (i.e., egocentric image captioning) for dietary assessment with passive monitoring, which unifies food recognition, volume estimation, and scene understanding. By converting images into rich text descriptions, nutritionists can assess individual dietary intake based on the captions instead of the original images, reducing the risk of privacy leakage from images. To this end, an egocentric dietary image captioning dataset has been built, which consists of in-the-wild images captured by head-worn and chest-worn cameras in field studies in Ghana. A novel transformer-based architecture is designed to caption egocentric dietary images. Comprehensive experiments have been conducted to evaluate the effectiveness and to justify the design of the proposed architecture for egocentric dietary image captioning. To the best of our knowledge, this is the first work that applies image captioning to dietary intake assessment in real life settings.

[1]  Chong-Wah Ngo,et al.  Deep-based Ingredient Recognition for Cooking Recipe Retrieval , 2016, ACM Multimedia.

[2]  Zhen Li,et al.  An exploratory study on a chest-worn computer for evaluation of diet, physical activity and lifestyle. , 2015, Journal of healthcare engineering.

[3]  Bonnie Spring,et al.  Food watch: detecting and characterizing eating episodes through feeding gestures , 2016 .

[4]  Jindong Liu,et al.  An Intelligent Food-Intake Monitoring System Using Wearable Sensors , 2012, 2012 Ninth International Conference on Wearable and Implantable Body Sensor Networks.

[5]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[6]  Mi Zhang,et al.  BodyBeat: a mobile system for sensing non-speech body sounds , 2014, MobiSys.

[7]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[8]  Tao Wang,et al.  Auracle: Detecting Eating Episodes with an Ear-mounted Sensor , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[9]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Kyungwon Oh,et al.  Dietary assessment methods in epidemiologic studies , 2014, Epidemiology and health.

[11]  Gian Luca Foresti,et al.  Wide-Slice Residual Networks for Food Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[12]  Keiji Yanai,et al.  Food image recognition using deep convolutional network with pre-training and fine-tuning , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[13]  Sergio Guadarrama,et al.  Im2Calories: Towards an Automated Mobile Vision Food Diary , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Benny P. L. Lo,et al.  Assessing Individual Dietary Intake in Food Sharing Scenarios with Food and Human Pose Detection , 2020, ICPR Workshops.

[15]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Akikazu Takeuchi,et al.  STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset , 2017, ACL.

[19]  Edward Sazonov,et al.  “Automatic Ingestion Monitor Version 2” – A Novel Wearable Device for Automatic Food Intake Detection and Passive Capture of Food Images , 2020, IEEE Journal of Biomedical and Health Informatics.

[20]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[21]  Adam W. Hoover,et al.  Assessing the Accuracy of a Wrist Motion Tracking Method for Counting Bites Across Demographic and Food Variables , 2017, IEEE Journal of Biomedical and Health Informatics.

[22]  Nabil Alshurafa,et al.  I sense overeating: Motif-based machine learning framework to detect overeating using wrist-worn sensing , 2018, Inf. Fusion.

[23]  Adam W. Hoover,et al.  A New Method for Measuring Meal Intake in Humans via Automated Wrist Motion Tracking , 2012, Applied Psychophysiology and Biofeedback.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  A. Rangan,et al.  Relative Validity of the Eat and Track (EaT) Smartphone App for Collection of Dietary Intake Data in 18-to-30-Year Olds , 2019, Nutrients.

[27]  Siyao Wang,et al.  Mining Discriminative Food Regions for Accurate Food Recognition , 2019, BMVC.

[28]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[29]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Shuqiang Jiang,et al.  Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition , 2019, ACM Multimedia.

[31]  Benny P. L. Lo,et al.  Development and Validation of an Objective, Passive Dietary Assessment Method for Estimating Food and Nutrient Intake in Households in Low- and Middle-Income Countries: A Study Protocol , 2020, Current developments in nutrition.

[32]  Yingnan Sun,et al.  Counting Bites and Recognizing Consumed Food from Videos for Passive Dietary Monitoring , 2020, IEEE Journal of Biomedical and Health Informatics.

[33]  Yujie Dong,et al.  Detecting Periods of Eating During Free-Living by Tracking Wrist Motion , 2014, IEEE Journal of Biomedical and Health Informatics.

[34]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Maysam Ghovanloo,et al.  Real-time swallowing detection based on tracheal acoustics , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Koji Yatani,et al.  BodyScope: a wearable acoustic sensor for activity recognition , 2012, UbiComp.

[37]  Rita Cucchiara,et al.  Meshed-Memory Transformer for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[39]  Yingnan Sun,et al.  Point2Volume: A Vision-Based Dietary Assessment Approach Using View Synthesis , 2020, IEEE Transactions on Industrial Informatics.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Benny Lo,et al.  Assessing Individual Dietary Intake in Food Sharing Scenarios with a 360 Camera and Deep Learning , 2019, 2019 IEEE 16th International Conference on Wearable and Implantable Body Sensor Networks (BSN).

[42]  Yingnan Sun,et al.  Food Volume Estimation Based on Deep Learning View Synthesis from a Single Depth Map , 2018, Nutrients.

[43]  Weili Guan,et al.  Chinese Image Caption Generation via Visual Attention and Topic Modeling , 2020, IEEE Transactions on Cybernetics.

[44]  Steven C. H. Hoi,et al.  Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Nabil Alshurafa,et al.  When generalized eating detection machine learning models fail in the field , 2017, UbiComp/ISWC Adjunct.

[46]  Jung Eun Lee,et al.  Use of a Mobile Application for Self-Monitoring Dietary Intake: Feasibility Test and an Intervention Study , 2017, Nutrients.

[47]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[48]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Eric P. Xing,et al.  Hybrid Retrieval-Generation Reinforced Agent for Medical Image Report Generation , 2018, NeurIPS.

[50]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[51]  Simao Herdade,et al.  Image Captioning: Transforming Objects into Words , 2019, NeurIPS.

[52]  Pengtao Xie,et al.  On the Automatic Generation of Medical Imaging Reports , 2017, ACL.

[53]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[54]  Philipp V. Rouast,et al.  Learning Deep Representations for Video-Based Intake Gesture Detection , 2019, IEEE Journal of Biomedical and Health Informatics.

[55]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[57]  Michael R. Neuman,et al.  Automatic Detection of Swallowing Events by Acoustical Means for Applications of Monitoring of Ingestive Behavior , 2010, IEEE Transactions on Biomedical Engineering.

[58]  Christos Diou,et al.  A Data Driven End-to-End Approach for In-the-Wild Monitoring of Eating Behavior Using Smartwatches , 2020, IEEE Journal of Biomedical and Health Informatics.

[59]  Gregory D. Abowd,et al.  A practical approach for recognizing eating moments with wrist-mounted inertial sensing , 2015, UbiComp.

[60]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[61]  Yi Yang,et al.  Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[63]  J. Lemacks,et al.  Dietary Intake Reporting Accuracy of the Bridge2U Mobile Application Food Log Compared to Control Meal and Dietary Recall Methods , 2019, Nutrients.

[64]  Shang-Ming Zhou,et al.  Automatically Generating Natural Language Descriptions of Images by a Deep Hierarchical Framework , 2021, IEEE Transactions on Cybernetics.

[65]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Konstantinos Kyritsis,et al.  Modeling Wrist Micromovements to Measure In-Meal Eating Behavior From Inertial Sensor Data , 2019, IEEE Journal of Biomedical and Health Informatics.

[67]  Matthieu Cord,et al.  Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings , 2018, SIGIR.

[68]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Luis Herranz,et al.  Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration , 2017, IEEE Transactions on Multimedia.

[70]  David J. Crandall,et al.  Deepdiary: Lifelogging image captioning and summarization , 2018, J. Vis. Commun. Image Represent..

[71]  Xiaodan Liang,et al.  Unifying Relational Sentence Generation and Retrieval for Medical Image Report Composition , 2020, IEEE Transactions on Cybernetics.

[72]  Antonio Torralba,et al.  Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Alejandro Betancourt,et al.  Egoshots, an ego-vision life-logging dataset and semantic fidelity metric to evaluate diversity in image captioning models , 2020, ICLR 2020.

[74]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[77]  Christos Diou,et al.  A Novel Chewing Detection System Based on PPG, Audio, and Accelerometry , 2017, IEEE Journal of Biomedical and Health Informatics.

[78]  Xiaoqiang Lu,et al.  Vision-to-Language Tasks Based on Attributes and Attention Mechanism , 2019, IEEE Transactions on Cybernetics.