Detecting Engagement in Egocentric Video

In a wearable camera video, we see what the camera wearer sees. While this makes it easy to know roughly Open image in new window , it does not immediately reveal Open image in new window . Specifically, at what moments did his focus linger, as he paused to gather more information about something he saw? Knowing this answer would benefit various applications in video summarization and augmented reality, yet prior work focuses solely on the “what” question (estimating saliency, gaze) without considering the “when” (engagement). We propose a learning-based approach that uses long-term egomotion cues to detect engagement, specifically in browsing scenarios where one frequently takes in new visual information (e.g., shopping, touring). We introduce a large, richly annotated dataset for ego-engagement that is the first of its kind. Our approach outperforms a wide array of existing methods. We show engagement can be detected well independent of both scene appearance and the camera wearer’s identity.

[1]  John R. Kender,et al.  On the structure and analysis of home videos , 2000 .

[2]  Yuichi Ohta,et al.  Structuring personal activity records based on attention-analyzing videos from head mounted camera , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[3]  HongJiang Zhang,et al.  A user attention model for video summarization , 2002, MULTIMEDIA '02.

[4]  Laurent Itti,et al.  Realistic avatar eye and head animation using a neurobiological model of visual attention , 2004, SPIE Optics + Photonics.

[5]  Maurizio Pilu,et al.  On the use of attention clues for an autonomous wearable camera 1 , 2003 .

[6]  Phil Cheatle Media content and type selection from always-on wearable video , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[7]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[8]  Bernhard Schölkopf,et al.  How to Find Interesting Locations in Video: A Spatiotemporal Interest Point Detector Learned from Human Eye Movements , 2007, DAGM-Symposium.

[9]  Changsheng Xu,et al.  A generic virtual content insertion system based on visual attention analysis , 2008, ACM Multimedia.

[10]  Arnold W. M. Smeulders,et al.  Brain responses strongly correlate with Weibull image statistics when processing natural images. , 2009, Journal of vision.

[11]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[12]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[13]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[14]  Mark Wexler,et al.  The nonlinear structure of motion perception during smooth eye movements. , 2009, Journal of vision.

[15]  Pierre Baldi,et al.  Bayesian surprise attracts human attention , 2005, Vision Research.

[16]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[17]  Thomas Martinetz,et al.  Variability of eye movements when viewing dynamic natural scenes. , 2010, Journal of vision.

[18]  Takahiro Okabe,et al.  Can Saliency Map Models Predict Human Egocentric Visual Attention? , 2010, ACCV Workshops.

[19]  Nuno Vasconcelos,et al.  Spatiotemporal Saliency in Dynamic Scenes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  U. Leonards,et al.  What makes cast shadows hard to see? , 2010, Journal of vision.

[21]  Esa Rahtu,et al.  Segmenting Salient Objects from Images and Videos , 2010, ECCV.

[22]  Zygmunt Pizlo,et al.  Camera Motion-Based Analysis of User Generated Video , 2010, IEEE Transactions on Multimedia.

[23]  Homer H. Chen,et al.  Learning-Based Prediction of Visual Attention for Video Signals , 2011, IEEE Transactions on Image Processing.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Stefan Carlsson,et al.  Novelty detection from an ego-centric perspective , 2011, CVPR 2011.

[26]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[27]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[28]  Takahiro Okabe,et al.  Attention Prediction in Egocentric Video Using Motion and Visual Saliency , 2011, PSIVT.

[29]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  A. G. Amitha Perera,et al.  A Videography Analysis Framework for Video Retrieval and Summarization , 2012, BMVC.

[33]  Sung Wook Baik,et al.  Efficient visual attention based framework for extracting key frames from videos , 2013, Signal Process. Image Commun..

[34]  Mohan S. Kankanhalli,et al.  Static saliency vs. dynamic saliency: a comparative study , 2013, ACM Multimedia.

[35]  Lihi Zelnik-Manor,et al.  Learning Video Saliency from Human Gaze Using Candidate Selection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Ali Farhadi,et al.  Action Recognition in the Presence of One Egocentric and Multiple Static Cameras , 2014, ACCV.

[39]  Shmuel Peleg,et al.  Temporal Segmentation of Egocentric Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Swati Rallapalli,et al.  Enabling physical analytics in retail stores using smart glasses , 2014, MobiCom.

[41]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[42]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[43]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[44]  Ling Shao,et al.  Spatial and temporal visual attention prediction in videos using eye movement data , 2014, Neurocomputing.

[45]  Shmuel Peleg,et al.  Wisdom of the Crowd in Egocentric Video Curation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[46]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).