Estimating Head Motion from Egocentric Vision

The recent availability of lightweight, wearable cameras allows for collecting video data from a "first-person' perspective, capturing the visual world of the wearer in everyday interactive contexts. In this paper, we investigate how to exploit egocentric vision to infer multimodal behaviors from people wearing head-mounted cameras. More specifically, we estimate head (camera) motion from egocentric video, which can be further used to infer non-verbal behaviors such as head turns and nodding in multimodal interactions. We propose several approaches based on Convolutional Neural Networks (CNNs) that combine raw images and optical flow fields to learn to distinguish regions with optical flow caused by global ego-motion from those caused by other motion in a scene. Our results suggest that CNNs do not directly learn useful visual features with end-to-end training from raw images alone; instead, a better approach is to first extract optical flow explicitly and then train CNNs to integrate optical flow and visual information.

[1]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[2]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[3]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  M. Hayhoe,et al.  The coordination of eye, head, and hand movements in a natural task , 2001, Experimental Brain Research.

[5]  Jeffrey M. Girard,et al.  Perceptions of Interpersonal Behavior are Influenced by Gender, Facial Expression Intensity, and Head Pose , 2014, ICMI.

[6]  Roope Raisamo,et al.  Comparison of three implementations of HeadTurn: a multimodal interaction technique with gaze and head turns , 2016, ICMI.

[7]  Hung-Hsuan Huang,et al.  Predicting Influential Statements in Group Discussions using Speech and Head Motion Information , 2014, ICMI.

[8]  Subramanian Ramanathan,et al.  On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions , 2013, ICMI '13.

[9]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Mohan M. Trivedi,et al.  Head Pose Estimation in Computer Vision: A Survey , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Maedeh Aghaei Social signal extraction from egocentric photo-streams , 2017, ICMI.

[13]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Yoshinori Kobayashi,et al.  A techno-sociological solution for designing a museum guide robot: Regarding choosing an appropriate visitor , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[15]  Matthias Rauterberg,et al.  The Evolution of First Person Vision Methods: A Survey , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Chen Yu,et al.  Detecting Hands in Children's Egocentric Views to Understand Embodied Attention during Social Interaction , 2014, CogSci.

[17]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[22]  Linda B. Smith,et al.  Objects in the center: How the infant's body constrains infant scenes , 2016, 2016 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).

[23]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[24]  Chen Yu,et al.  Viewpoint Integration for Hand-Based Recognition of Social Interactions from a First-Person View , 2015, ICMI.

[25]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.