Temporal Segmentation of Egocentric Videos to Highlight Personal Locations of Interest

With the increasing availability of wearable cameras, the acquisition of egocentric videos is becoming common in many scenarios. However, the absence of explicit structure in such videos (e.g., video chapters) makes their exploitation difficult. We propose to segment unstructured egocentric videos to highlight the presence of personal locations of interest specified by the end-user. Given the large variability of the visual content acquired by such devices, it is necessary to design explicit rejection mechanisms able to detect negatives (i.e., frames not related to any considered location) learning only from positive ones at training time. To challenge the problem, we collected a dataset of egocentric videos containing 10 personal locations of interest. We propose a method to segment egocentric videos performing discrimination among the personal locations of interest, rejection of negative frames, and enforcing temporal coherence between neighboring predictions.

[1]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[2]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[3]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[4]  Anind K. Dey,et al.  Lifelogging memory appliance for people with episodic memory impairment , 2008, UbiComp.

[5]  David J. Crandall,et al.  PlaceAvoider: Steering First-Person Cameras away from Sensitive Spaces , 2014, NDSS.

[6]  M. White,et al.  Police Officer Body-Worn Cameras: Assessing the Evidence , 2014 .

[7]  Giovanni Maria Farinella,et al.  Generalized Sobel Filters for gradient estimation of distorted images , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[8]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[9]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[10]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[11]  Alex Pentland,et al.  Visual contextual awareness in wearable computing , 1998, Digest of Papers. Second International Symposium on Wearable Computers (Cat. No.98EX215).

[12]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[13]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[14]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Kiyoharu Aizawa,et al.  Summarizing wearable video , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[17]  Giovanni Maria Farinella,et al.  Affine region detectors on the fisheye domain , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[18]  Alan F. Smeaton,et al.  LifeLogging: Personal Big Data , 2014, Found. Trends Inf. Retr..

[19]  James M. Rehg,et al.  Gaze-enabled egocentric video summarization via constrained submodular maximization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Bernt Schiele,et al.  Towards improving trust in context-aware systems by displaying system confidence , 2005, Mobile HCI.

[21]  Giovanni Maria Farinella,et al.  Representing scenes for real-time context classification on mobile devices , 2015, Pattern Recognit..

[22]  Alex Pentland,et al.  Recognizing Personal Location from Video , 1998 .

[23]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[24]  Antonio Torralba,et al.  Context-based vision system for place and object recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  A. Doria RECfusion : Automatic Video Curation Driven by Visual Content Popularity , 2016 .

[26]  Andrew Zisserman,et al.  MLESAC: A New Robust Estimator with Application to Estimating Image Geometry , 2000, Comput. Vis. Image Underst..

[27]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[28]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[29]  Shmuel Peleg,et al.  Temporal Segmentation of Egocentric Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[31]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Gregory D. Abowd,et al.  Predicting daily activities from egocentric images using deep learning , 2015, SEMWEB.

[34]  Giovanni Maria Farinella,et al.  Recognizing Personal Contexts from Egocentric Images , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[35]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[36]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..