Predicting Visual Context for Unsupervised Event Segmentation in Continuous Photo-streams

Segmenting video content into events provides semantic structures for indexing, retrieval, and summarization. Since motion cues are not available in continuous photo-streams, and annotations in lifelogging are scarce and costly, the frames are usually clustered into events by comparing the visual features between them in an unsupervised way. However, such methodologies are ineffective to deal with heterogeneous events, e.g. taking a walk, and temporary changes in the sight direction, e.g. at a meeting. To address these limitations, we propose Contextual Event Segmentation (CES), a novel segmentation paradigm that uses an LSTM-based generative network to model the photo-stream sequences, predict their visual context, and track their evolution. CES decides whether a frame is an event boundary by comparing the visual context generated from the frames in the past, to the visual context predicted from the future. We implemented CES on a new and massive lifelogging dataset consisting of more than 1.5 million images spanning over 1,723 days. Experiments on the popular EDUB-Seg dataset show that our model outperforms the state-of-the-art by over 16% in f-measure. Furthermore, CES' performance is only 3 points below that of human annotators.

[1]  Alan F. Smeaton,et al.  Combining image descriptors to effectively retrieve events from visual lifelogs , 2008, MIR '08.

[2]  Kristen Grauman,et al.  Intentional Photos from an Unintentional Photographer: Detecting Snap Points in Egocentric Video with a Web Photo Prior , 2014, Mobile Cloud Visual Media Computing.

[3]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[5]  Joo-Hwee Lim,et al.  Active Video Summarization: Customized Summaries via On-line Interaction with the User , 2017, AAAI.

[6]  Alan F. Smeaton,et al.  Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs , 2008, CIVR '08.

[7]  Tao Mei,et al.  Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks , 2017, AAAI.

[8]  Wei-Hao Lin,et al.  Structuring continuous video recordings of everyday life using time-constrained clustering , 2006, Electronic Imaging.

[9]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[11]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[12]  Marc Langheinrich,et al.  Remembering through lifelogging: A survey of human memory augmentation , 2016, Pervasive Mob. Comput..

[13]  Irfan A. Essa,et al.  Discovering picturesque highlights from egocentric vacation videos , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Michael Riegler,et al.  Overview of ImageCLEFlifelog 2017: Lifelog Retrieval and Summarization , 2017, CLEF.

[16]  Giovanni Maria Farinella,et al.  Personal-location-based temporal segmentation of egocentric videos for lifelogging applications , 2018, J. Vis. Commun. Image Represent..

[17]  Petia Radeva,et al.  Egocentric video description based on temporally-linked sequences , 2018, J. Vis. Commun. Image Represent..

[18]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Rita Cucchiara,et al.  Egocentric Video Summarization of Cultural Tour based on User Preferences , 2015, ACM Multimedia.

[20]  Licheng Yu,et al.  Hierarchically-Attentive RNN for Album Summarization and Storytelling , 2017, EMNLP.

[21]  Rami Albatal,et al.  Overview of NTCIR-13 Lifelog-2 Task , 2017, NTCIR.

[22]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[23]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[24]  Petia Radeva,et al.  Visual summary of egocentric photostreams by representative keyframes , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[25]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[26]  Hiroyuki Toda,et al.  PBG at the NTCIR-13 Lifelog-2 LAT , LSAT , and LEST Tasks , 2017 .

[27]  Shmuel Peleg,et al.  Temporal Segmentation of Egocentric Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Vigneshwaran Subbaraju,et al.  VC-I2R@ImageCLEF2017: Ensemble of Deep Learned Features for Lifelog Video Summarization , 2017, CLEF.

[29]  Michael Gygli,et al.  PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation , 2018, ACM Multimedia.

[30]  James M. Rehg,et al.  Gaze-enabled egocentric video summarization via constrained submodular maximization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Kiyoharu Aizawa,et al.  Summarization of wearable videos using support vector machine , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[32]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[33]  Joo-Hwee Lim,et al.  Summarization of Egocentric Videos: A Comprehensive Survey , 2017, IEEE Transactions on Human-Machine Systems.

[34]  Petia Radeva,et al.  Toward Storytelling From Visual Lifelogging: An Overview , 2015, IEEE Transactions on Human-Machine Systems.

[35]  C. V. Jawahar,et al.  Unsupervised Learning of Deep Feature Representation for Clustering Egocentric Actions , 2017, IJCAI.

[36]  Vigneshwaran Subbaraju,et al.  VCI2R at the NTCIR-13 Lifelog-2 Lifelog Semantic Access Task , 2017, NTCIR Conference on Evaluation of Information Access Technologies.

[37]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[38]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Rami Albatal,et al.  Overview of NTCIR-12 Lifelog Task , 2016, NTCIR.

[40]  Petia Radeva,et al.  SR-clustering: Semantic regularized clustering for egocentric photo streams segmentation , 2015, Comput. Vis. Image Underst..

[41]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Kiyoharu Aizawa,et al.  Summarizing wearable video , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[43]  Alan F. Smeaton,et al.  Lifelogging and EEG: utilising neural signals for sorting lifelog image data , 2014 .