Hand-Priming in Object Localization for Assistive Egocentric Vision

Egocentric vision holds great promises for increasing access to visual information and improving the quality of life for people with visual impairments, with object recognition being one of the daily challenges for this population. While we strive to improve recognition performance, it remains difficult to identify which object is of interest to the user; the object may not even be included in the frame due to challenges in camera aiming without visual feedback. Also, gaze information, commonly used to infer the area of interest in egocentric vision, is often not dependable. However, blind users often tend to include their hand either interacting with the object that they wish to recognize or simply placing it in proximity for better camera aiming. We propose localization models that leverage the presence of the hand as the contextual information for priming the center area of the object of interest. In our approach, hand segmentation is fed to either the entire localization network or its last convolutional layers. Using egocentric datasets from sighted and blind individuals, we show that the handpriming achieves higher precision than other approaches, such as fine-tuning, multi-class, and multi-task learning, which also encode hand-object interactions in localization.

[1]  Marc Pollefeys,et al.  H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  C. Sherrington ON THE PROPRIO-CEPTIVE SYSTEM, ESPECIALLY IN ITS REFLEX ASPECT , 1907 .

[4]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[7]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  S. Gandevia,et al.  The proprioceptive senses: their roles in signaling body shape, body position and movement, and muscle force. , 2012, Physiological reviews.

[9]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Irving Biederman,et al.  On the Semantics of a Glance at a Scene , 2017 .

[11]  Scott T. Grafton,et al.  Reductions in neural activity underlie behavioral components of repetition priming , 2005, Nature Neuroscience.

[12]  James M. Rehg,et al.  In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.

[13]  Yingfang Meng,et al.  Neural processing of recollection, familiarity and priming at encoding: Evidence from a forced-choice recognition paradigm , 2014, Brain Research.

[14]  Zdenek Míkovec,et al.  BlindCamera: Central and Golden-ratio Composition for Blind Photographers , 2015, MIDI '15.

[15]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[16]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[18]  Abhinav Gupta,et al.  Contextual Priming and Feedback for Faster R-CNN , 2016, ECCV.

[19]  Sanja Fidler,et al.  segDeepM: Exploiting segmentation and context in deep neural networks for object detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[22]  M. Pollefeys,et al.  Unified Egocentric Recognition of 3 D Hand-Object Poses and Interactions , 2019 .

[23]  Yuhang Zhao,et al.  A Face Recognition Application for People with Visual Impairments: Understanding Use Beyond the Lab , 2018, CHI.

[24]  Kyungjun Lee,et al.  Revisiting Blind Photography in the Context of Teachable Object Recognizers , 2019, ASSETS.

[25]  Fabien Lagriffoul,et al.  Activity Recognition Using an Egocentric Perspective of Everyday Objects , 2007, UIC.

[26]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Deva Ramanan,et al.  3D Hand Pose Detection in Egocentric RGB-D Images , 2014, ECCV Workshops.

[29]  Esa Rahtu,et al.  Digging Deeper Into Egocentric Gaze Prediction , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Elsevier Sdol,et al.  Journal of Visual Communication and Image Representation , 2009 .

[32]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Yoichi Sato,et al.  Understanding Hand-Object Manipulation with Grasp Types and Object Attributes , 2016, Robotics: Science and Systems.

[35]  Antonio Torralba,et al.  Context-based vision system for place and object recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[36]  Yi Li,et al.  Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[39]  Jeffrey P. Bigham,et al.  VizWiz::LocateIt - enabling blind people to locate objects in their environment , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[40]  Aaron Steinfeld,et al.  Helping visually impaired users properly aim a camera , 2012, ASSETS '12.

[41]  Jeffrey P. Bigham,et al.  Supporting blind photography , 2011, ASSETS.

[42]  tephen E. Palmer The effects of contextual scenes on the identification of objects , 1975, Memory & cognition.

[43]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Kyungjun Lee,et al.  Hands Holding Clues for Object Recognition in Teachable Machines , 2019, CHI.

[45]  Sang Ho Yoon,et al.  Robust Hand Pose Estimation during the Interaction with an Unknown Object , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  E Tulving,et al.  Priming and human memory systems. , 1990, Science.

[47]  Aaron Steinfeld,et al.  An Assisted Photography Framework to Help Visually Impaired Users Properly Aim a Camera , 2014, TCHI.

[48]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[49]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[50]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[51]  John F Kalaska,et al.  Evidence for a Proprioception-Based Rapid On-Line Error Correction Mechanism for Hand Orientation during Reaching Movements in Blind Subjects , 2009, The Journal of Neuroscience.

[52]  Erin Brady,et al.  Visual challenges in the everyday lives of blind people , 2013, CHI.

[53]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[54]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Yonghong Peng,et al.  Gaze-Informed Egocentric Action Recognition for Memory Aid Systems , 2018, IEEE Access.

[56]  Jianbo Shi,et al.  Unsupervised Learning of Important Objects from First-Person Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[59]  Jeffrey P. Bigham,et al.  Real time object scanning using a mobile phone and cloud-based visual search engine , 2013, ASSETS.

[60]  Jeffrey P. Bigham,et al.  Investigating Cursor-based Interactions to Support Non-Visual Exploration in the Real World , 2018, ASSETS.

[61]  Yoichi Sato,et al.  Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition , 2018, ECCV.

[62]  Chieko Asakawa,et al.  People with Visual Impairment Training Personal Object Recognizers: Feasibility and Challenges , 2017, CHI.

[63]  Jeffrey P. Bigham,et al.  EasySnap: real-time audio feedback for blind photography , 2010, UIST '10.

[64]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[65]  Giovanni Maria Farinella,et al.  Next-active-object prediction from egocentric videos , 2017, J. Vis. Commun. Image Represent..