Less Is More: Learning Highlight Detection From Video Duration

Highlight detection has the potential to significantly ease video browsing, but existing methods often suffer from expensive supervision requirements, where human viewers must manually identify highlights in training videos. We propose a scalable unsupervised solution that exploits video duration as an implicit supervision signal. Our key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, we introduce a novel ranking framework that prefers segments from shorter videos, while properly accounting for the inherent noise in the (unlabeled) training data. We use it to train a highlight detector with 10M hashtagged Instagram videos. In experiments on two challenging public video highlight detection benchmarks, our method substantially improves the state-of-the-art for unsupervised highlight detection.

[1]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yale Song,et al.  Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[4]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[5]  Amit K. Roy-Chowdhury,et al.  Collaborative Summarization of Topic-Related Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Yale Song,et al.  Learning from Noisy Labels with Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Ming-Syan Chen,et al.  Video Event Detection by Inferring Temporal Instance Labels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[11]  Hao Tang,et al.  Detecting highlights in sports videos: Cricket as a test case , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[12]  Yongdong Zhang,et al.  Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Chng Eng Siong,et al.  Sports highlight detection from keyword sequences using HMM , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[14]  Regunathan Radhakrishnan,et al.  Highlights extraction from sports video based on an audio-visual marker detection framework , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[15]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  James M. Rehg,et al.  Gaze-enabled egocentric video summarization via constrained submodular maximization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[18]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[20]  Gunhee Kim,et al.  Storyline Representation of Egocentric Videos with an Applications to Story-Based Search , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[22]  Coskun Bayrak,et al.  Sports video summarization based on motion analysis , 2013, Comput. Electr. Eng..

[23]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[24]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Summarization of Web Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Dong Liu,et al.  $\propto$SVM for learning with label proportions , 2013, ICML 2013.

[27]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Eric P. Xing,et al.  Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Max Welling,et al.  Attention-based Deep Multiple Instance Learning , 2018, ICML.

[31]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[34]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[37]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[38]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Eric P. Xing,et al.  Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.