Story-Driven Summarization for Egocentric Video

We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video sub shots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a random-walk based metric of influence between sub shots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subs hot summary. Whereas traditional methods optimize a summary's diversity or representative ness, ours explicitly accounts for how one sub-event "leads to" another-which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.

[1]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[2]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Adel M. Alimi,et al.  IM(S)2: Interactive movie summarization system , 2010, J. Vis. Commun. Image Represent..

[4]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Stephen W. Smoliar,et al.  An integrated system for content-based video retrieval and browsing , 1997, Pattern Recognit..

[6]  Yael Pritch,et al.  Webcam Synopsis: Peeking Around the World , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[7]  Dafna Shahaf,et al.  Connecting the dots between news articles , 2011, IJCAI 2011.

[8]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Thomas Deselaers,et al.  What is an object? , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Chong-Wah Ngo,et al.  Automatic video summarization by graph modeling , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[11]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[12]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[13]  Patrick Lambert,et al.  Video summarization from spatio-temporal features , 2008, TVS '08.

[14]  NamJeho,et al.  Event-Driven Video Abstraction and Visualization , 2002 .

[15]  Wayne H. Wolf,et al.  Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[17]  Stefan Carlsson,et al.  Novelty detection from an ego-centric perspective , 2011, CVPR 2011.

[18]  Alan F. Smeaton,et al.  Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs , 2008, CIVR '08.

[19]  Jeho Nam,et al.  Event-Driven Video Abstraction and Visualization , 2004, Multimedia Tools and Applications.

[20]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[21]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  John R. Kender,et al.  Optimization Algorithms for the Selection of Key Frame Sequences of Variable Length , 2002, ECCV.

[23]  David Salesin,et al.  Schematic storyboarding for video visualization and editing , 2006, SIGGRAPH 2006.

[24]  Nebojsa Jojic,et al.  Structural epitome: a way to summarize one's visual experience , 2010, NIPS.

[25]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[26]  Michael F. Cohen,et al.  Creating map-based storyboards for browsing tour videos , 2008, UIST '08.

[27]  Gang Hua,et al.  A Hierarchical Visual Model for Video Object Summarization , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Patricia Ladret,et al.  The blur effect: perception and estimation with a new no-reference perceptual blur metric , 2007, Electronic Imaging.