Contrastive Learning for Unsupervised Video Highlight Detection

Video highlight detection can greatly simplify video browsing, potentially paving the way for a wide range of ap-plications. Existing efforts are mostly fully-supervised, requiring humans to manually identify and label the interesting moments (called highlights) in a video. Recent weakly supervised methods forgo the use of highlight annotations, but typically require extensive efforts in collecting external data such as web-crawled videos for model learning. This observation has inspired us to consider unsupervised highlight detection where neither frame-level nor video-level annotations are available in training. We propose a simple contrastive learning framework for unsupervised highlight detection. Our framework encodes a video into a vector representation by learning to pick video clips that help to distinguish it from other videos via a contrastive objective using dropout noise. This inherently allows our framework to identify video clips corresponding to highlight of the video. Extensive empirical evaluations on three highlight detection benchmarks demonstrate the superior performance of our approach.

[1]  Yang Wang,et al.  Joint Visual and Audio Learning for Video Highlight Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Danqi Chen,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[3]  Dimitris N. Metaxas,et al.  Learning Trailer Moments in Full-Length Movies , 2020, ECCV.

[4]  Weishi Zheng,et al.  MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection , 2020, ECCV.

[5]  Linwei Ye,et al.  Adaptive Video Highlight Detection by Learning from User History , 2020, ECCV.

[6]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[7]  Yannis Kalantidis,et al.  Less Is More: Learning Highlight Detection From Video Duration , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yang Wang,et al.  Video Summarization by Learning From Unpaired Data , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Paolo Remagnino,et al.  Summarizing Videos with Attention , 2018, ACCV Workshops.

[10]  Larry S. Davis,et al.  Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior , 2018, ECCV.

[11]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[12]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[13]  Michael Gygli,et al.  PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation , 2018, ACM Multimedia.

[14]  Gunhee Kim,et al.  A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360 Video , 2018, AAAI.

[15]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[16]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Changsheng Xu,et al.  Video Highlight Detection via Deep Ranking Modeling , 2017, PSIVT.

[18]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[19]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Summarization of Web Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Amit K. Roy-Chowdhury,et al.  Collaborative Summarization of Topic-Related Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[26]  Yale Song,et al.  Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[36]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[37]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[38]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[39]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[40]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Eric P. Xing,et al.  Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[43]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Hao Tang,et al.  Detecting highlights in sports videos: Cricket as a test case , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[47]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[48]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[49]  Regunathan Radhakrishnan,et al.  Highlights extraction from sports video based on an audio-visual marker detection framework , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[50]  Chng Eng Siong,et al.  Sports highlight detection from keyword sequences using HMM , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[51]  Chong-Wah Ngo,et al.  Automatic video summarization by graph modeling , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[52]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.