Following Gaze Across Views

Following the gaze of people inside videos is an important signal for understanding people and their actions. In this paper, we present an approach for following gaze across views by predicting where a particular person is looking throughout a scene. We collect VideoGaze, a new dataset which we use as a benchmark to both train and evaluate models. Given one view with a person in it and a second view of the scene, our model estimates a density for gaze location in the second view. A key aspect of our approach is an end-to-end model that solves the following sub-problems: saliency, gaze pose, and geometric relationships between views. Although our model is supervised only with gaze, we show that the model learns to solve these subproblems automatically without supervision. Experiments suggest that our approach follows gaze better than standard baselines and produces plausible results for everyday situations.

[1]  Andrew Zisserman,et al.  "Here's looking at you, kid". Detecting people looking at each other in videos , 2011, BMVC.

[2]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[3]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[4]  Geoffrey E. Hinton A Parallel Computation that Assigns Canonical Object-Based Frames of Reference , 1981, IJCAI.

[5]  Yaser Sheikh,et al.  Predicting Primary Gaze Behavior Using Social Saliency Fields , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Frédo Durand,et al.  Where Should Saliency Models Look Next? , 2016, ECCV.

[7]  Ruimin Hu,et al.  A novel method for generation of motion saliency , 2010, 2010 IEEE International Conference on Image Processing.

[8]  Ali Borji,et al.  Augmented saliency model using automatic 3D head pose detection and learned gaze following in natural scenes , 2015, Vision Research.

[9]  Shan Li,et al.  Fast Visual Tracking using Motion Saliency in Video , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[11]  Hui Cheng,et al.  3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Wen Gao,et al.  A dataset and evaluation methodology for visual saliency in video , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[13]  Andrew Zisserman,et al.  Detecting People Looking at Each Other in Videos , 2014, International Journal of Computer Vision.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Marcello Pelillo,et al.  A Game-Theoretic Probabilistic Approach for Detecting Conversational Groups , 2014, ACCV.

[16]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Antonio Torralba,et al.  Where are they looking? , 2015, NIPS.

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Neil Martin Robertson,et al.  Deep Head Pose: Gaze-Direction Estimation in Multimodal Video , 2015, IEEE Transactions on Multimedia.

[20]  Max Jaderberg,et al.  Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[21]  Jianbo Shi,et al.  Social saliency prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Andrew Zisserman,et al.  Talking Heads: Detecting Humans and Recognizing Their Interactions , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Tomasz Malisiewicz,et al.  Deep Image Homography Estimation , 2016, ArXiv.

[26]  Dariu Gavrila,et al.  Context-Based Pedestrian Path Prediction , 2014, ECCV.

[27]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[28]  Viorica Patraucean,et al.  gvnn: Neural Network Library for Geometric Computer Vision , 2016, ECCV Workshops.

[29]  Aude Billard,et al.  Teaching a Humanoid Robot to Recognize and Reproduce Social Cues , 2006, ROMAN 2006 - The 15th IEEE International Symposium on Robot and Human Interactive Communication.