Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)

Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction "hotspots" directly from video. Rather than treat affordances as a manually supervised semantic segmentation task, our approach learns about interactions by watching videos of real human behavior and anticipating afforded actions. Given a novel image or video, our model infers a spatial hotspot map indicating how an object would be manipulated in a potential interaction, even if the object is currently at rest. Through results with both first and third person video, we show the value of grounding affordances in real human-object interactions. Not only are our weakly supervised hotspots competitive with strongly supervised affordance methods, but they can also anticipate object interaction for novel object categories. Project page: this http URL

[1]  Yiannis Aloimonos,et al.  Affordance detection of tool parts from geometric features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Rita Cucchiara,et al.  A deep multi-level network for saliency prediction , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[3]  Juergen Gall,et al.  Weakly Supervised Affordance Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Noel E. O'Connor,et al.  SalGAN: Visual Saliency Prediction with Generative Adversarial Networks , 2017, ArXiv.

[6]  James J. Gibson,et al.  The Ecological Approach to Visual Perception: Classic Edition , 2014 .

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Song-Chun Zhu,et al.  Understanding tools: Task-oriented object modeling, learning and recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Silvio Savarese,et al.  Demo2Vec: Reasoning Object Affordances from Online Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  James M. Rehg,et al.  Affordance Prediction via Learned Object Attributes , 2011 .

[12]  Nikolaos G. Tsagarakis,et al.  Detecting object affordances with Convolutional Neural Networks , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[13]  Yoichi Sato,et al.  Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition , 2018, ECCV.

[14]  Nikolaos G. Tsagarakis,et al.  Object-based affordances detection with Convolutional Neural Networks and dense Conditional Random Fields , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Barbara Caputo,et al.  Using Object Affordances to Improve Object Recognition , 2011, IEEE Transactions on Autonomous Mental Development.

[17]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[18]  Matthias Bethge,et al.  DeepGaze II: Reading fixations from deep features trained on object recognition , 2016, ArXiv.

[19]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Salman Khan,et al.  Visual Affordance and Function Understanding , 2018, ACM Comput. Surv..

[22]  Sinisa Todorovic,et al.  A Multi-scale CNN for Affordance Segmentation in RGB Images , 2016, ECCV.

[23]  Chenfanfu Jiang,et al.  Inferring Forces and Learning Human Utilities from Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hema Swetha Koppula,et al.  Physically Grounded Spatio-temporal Object Affordances , 2014, ECCV.