Predicting Important Objects for Egocentric Video Summarization

We present a video summarization approach for egocentric or “wearable” camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer’s day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of high-level saliency in egocentric video—such as the nearness to hands, gaze, and frequency of occurrence—and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key object-driven happenings. We adjust the compactness of the final summary given either an importance selection criterion or a length budget; for the latter, we design an efficient dynamic programming solution that accounts for importance, visual uniqueness, and temporal displacement. Critically, the approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results on two egocentric video datasets show the method’s promise relative to existing techniques for saliency and summarization.

[1]  Wayne H. Wolf,et al.  Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Stephen W. Smoliar,et al.  An integrated system for content-based video retrieval and browsing , 1997, Pattern Recognit..

[3]  Steve Mann,et al.  'WearCam' (The wearable camera): personal imaging systems for long-term use in wearable tetherless computer-mediated reality and personal photo/videographic memory prosthesis , 1998, Digest of Papers. Second International Symposium on Wearable Computers (Cat. No.98EX215).

[4]  Pietro Perona,et al.  A Factorization Approach to Grouping , 1998, ECCV.

[5]  Jennifer Healey,et al.  StartleCam: a cybernetic wearable camera , 1998, Digest of Papers. Second International Symposium on Wearable Computers (Cat. No.98EX215).

[6]  Alex Pentland,et al.  Visual contextual awareness in wearable computing , 1998, Digest of Papers. Second International Symposium on Wearable Computers (Cat. No.98EX215).

[7]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Alex Pentland,et al.  Unsupervised clustering of ambulatory audio and video , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[10]  Kiyoharu Aizawa,et al.  Summarization of wearable videos using support vector machine , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[11]  John R. Kender,et al.  Optimization Algorithms for the Selection of Key Frame Sequences of Variable Length , 2002, ECCV.

[12]  John R. Kender,et al.  Video Summaries through Mosaic-Based Shot and Scene Clustering , 2002, ECCV.

[13]  Mathias Kölsch,et al.  Robust hand detection , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[14]  James M. Rehg,et al.  Statistical Color Models with Application to Skin Detection , 2004, International Journal of Computer Vision.

[15]  David Salesin,et al.  Schematic storyboarding for video visualization and editing , 2006, SIGGRAPH 2006.

[16]  Yael Pritch,et al.  Making a Long Video Short: Dynamic Video Synopsis , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Wei-Hao Lin,et al.  Structuring continuous video recordings of everyday life using time-constrained clustering , 2006, Electronic Imaging.

[18]  Christof Koch,et al.  Modeling attention to salient proto-objects , 2006, Neural Networks.

[19]  Yasuyuki Matsushita,et al.  Dynamic stills and clip trailers , 2006, The Visual Computer.

[20]  Shahram Izadi,et al.  SenseCam: A Retrospective Memory Aid , 2006, UbiComp.

[21]  Yael Pritch,et al.  Webcam Synopsis: Peeking Around the World , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  Nuno Vasconcelos,et al.  The discriminant center-surround hypothesis for bottom-up saliency , 2007, NIPS.

[23]  Anind K. Dey,et al.  Providing good memory cues for people with episodic memory impairment , 2007, Assets '07.

[24]  Nanning Zheng,et al.  Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Denis Simakov,et al.  Summarizing visual data using bidirectional similarity , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[27]  Pietro Perona,et al.  Some Objects Are More Equal Than Others: Measuring and Predicting Importance , 2008, ECCV.

[28]  Alan F. Smeaton,et al.  Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs , 2008, CIVR '08.

[29]  Bernt Schiele,et al.  Discovery of activity patterns using topic models , 2008 .

[30]  A.F. Smeaton,et al.  Combining Face Detection and Novelty to Identify Important Events in a Visual Lifelog , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[31]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[32]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[33]  Bernard Mérialdo,et al.  Multi-document video summarization , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[34]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Thomas Deselaers,et al.  What is an object? , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  Cristian Sminchisescu,et al.  Constrained parametric min-cuts for automatic object segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[37]  Kristen Grauman,et al.  Accounting for the Relative Importance of Objects in Image Retrieval , 2010, BMVC.

[38]  Nebojsa Jojic,et al.  Structural epitome: a way to summarize one's visual experience , 2010, NIPS.

[39]  Gang Hua,et al.  A Hierarchical Visual Model for Video Object Summarization , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Derek Hoiem,et al.  Category Independent Object Proposals , 2010, ECCV.

[41]  Stefan Carlsson,et al.  Novelty detection from an ego-centric perspective , 2011, CVPR 2011.

[42]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[43]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[44]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[45]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[46]  Steve Hodges,et al.  SenseCam: A wearable camera that stimulates and rehabilitates autobiographical memory , 2011, Memory.

[47]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Cheng Li,et al.  Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[55]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[56]  智一 吉田,et al.  Efficient Graph-Based Image Segmentationを用いた圃場図自動作成手法の検討 , 2014 .