Summarizing Unconstrained Videos Using Salient Montages

We present a novel method to summarize unconstrained videos using salient montages (i.e., a “melange” of frames in the video as shown in <xref ref-type="fig" rid="fig1">Fig. 1</xref>, by finding “montageable moments” and identifying the salient people and actions to depict in each montage. Our method aims at addressing the increasing need for generating concise visualizations from the large number of videos being captured from portable devices. Our main contributions are (1) the process of finding salient people and moments to form a montage, and (2) the application of this method to videos taken “in the wild” where the camera moves freely. As such, we demonstrate results on head-mounted cameras, where the camera moves constantly, as well as on videos downloaded from YouTube. In our experiments, we show that our method can reliably detect and track humans under significant action and camera motion. Moreover, the predicted salient people are more accurate than results from state-of-the-art video salieny method <xref ref-type="bibr" rid="ref1">[1]</xref> . Finally, we demonstrate that a novel “montageability” score can be used to retrieve results with relatively high precision which allows us to present high quality montages to users.

[1]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[2]  Yael Pritch,et al.  Making a Long Video Short: Dynamic Video Synopsis , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  Charless C. Fowlkes,et al.  Globally-optimal greedy algorithms for tracking a variable number of objects , 2011, CVPR 2011.

[4]  Min Chen,et al.  A Survey on Video-based Graphics and Video Visualization , 2011, Eurographics.

[5]  Nuno Vasconcelos,et al.  Spatiotemporal Saliency in Dynamic Scenes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[7]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Different Scenes , 2008, ECCV.

[8]  Yael Pritch,et al.  Saliency filters: Contrast based filtering for salient region detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[10]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[11]  Liming Zhang,et al.  Spatio-temporal Saliency detection using phase spectrum of quaternion fourier transform , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Yu Hen Hu,et al.  Discovering panoramas in web videos , 2008, ACM Multimedia.

[13]  Steven M. Drucker,et al.  Cliplets: juxtaposing still and dynamic imagery , 2012, UIST.

[14]  Xin Liu,et al.  Video summarization and retrieval using singular value decomposition , 2003, Multimedia Systems.

[15]  Chong-Wah Ngo,et al.  Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  P. Anandan,et al.  Mosaic based representations of video sequences and their applications , 1995, Proceedings of IEEE International Conference on Computer Vision.

[17]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Yael Pritch,et al.  Webcam Synopsis: Peeking Around the World , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[19]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[20]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Somnath Sengupta,et al.  Event-Importance Based Customized and Automatic Cricket Highlight Generation , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[22]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  David Salesin,et al.  Interactive digital photomontage , 2004, SIGGRAPH 2004.

[24]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25]  Qingshan Liu,et al.  Temporal spectral residual: fast motion saliency detection , 2009, ACM Multimedia.

[26]  Alan Fern,et al.  Multi-object Tracking via Constrained Sequential Labeling , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Hanspeter Pfister,et al.  Video Snapshots: Creating High-Quality Images from Video Clips , 2012, IEEE Transactions on Visualization and Computer Graphics.

[28]  Alan Hanjalic,et al.  Adaptive extraction of highlights from a sport video based on excitement modeling , 2005, IEEE Transactions on Multimedia.

[29]  Ramakant Nevatia,et al.  An online learned CRF model for multi-target tracking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Konrad Schindler,et al.  Detection- and Trajectory-Level Exclusion in Multiple Object Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Katerina Fragkiadaki,et al.  Two-Granularity Tracking: Mediating Trajectory and Detection Graphs for Tracking under Occlusions , 2012, ECCV.

[32]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  Dana H. Ballard,et al.  Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[34]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[35]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  David Salesin,et al.  Schematic storyboarding for video visualization and editing , 2006, SIGGRAPH 2006.

[38]  Chng Eng Siong,et al.  Sports highlight detection from keyword sequences using HMM , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[39]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[41]  Lihi Zelnik-Manor,et al.  Context-Aware Saliency Detection , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Xin Liu,et al.  Video summarization using singular value decomposition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[43]  Regunathan Radhakrishnan,et al.  Highlights extraction from sports video based on an audio-visual marker detection framework , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[44]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  John R. Kender,et al.  Video Summaries through Mosaic-Based Shot and Scene Clustering , 2002, ECCV.

[46]  Roberto Turrin,et al.  Performance of recommender algorithms on top-n recommendation tasks , 2010, RecSys '10.

[47]  Lihi Zelnik-Manor,et al.  Learning Video Saliency from Human Gaze Using Candidate Selection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[49]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[50]  Walter Bender,et al.  Salient Stills: Process and Practice , 1996, IBM Syst. J..

[51]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[52]  Surya Nepal,et al.  Automatic detection of 'Goal' segments in basketball videos , 2001, MULTIMEDIA '01.

[53]  Gang Hua,et al.  A Hierarchical Visual Model for Video Object Summarization , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Richard Szeliski,et al.  First-person hyper-lapse videos , 2014, ACM Trans. Graph..

[55]  Hao Tang,et al.  Detecting highlights in sports videos: Cricket as a test case , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[56]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.