Learning to Summarize Videos by Contrasting Clips

Video summarization aims at choosing parts of a video that narrate a story as close as possible to the original one. Most of the existing video summarization approaches focus on hand-crafted labels. As the number of videos grows exponentially, there emerges an increasing need for methods that can learn meaningful summarizations without labeled annotations. In this paper, we aim to maximally exploit unsupervised video summarization while concentrating the supervision to a few, personalized labels as an add-on. To do so, we formulate the key requirements for the informative video summarization. Then, we propose contrastive learning as the answer to both questions. To further boost Contrastive video Summarization (CSUM), we propose to contrast top-k features instead of a mean video feature as employed by the existing method, which we implement with a differentiable top-k feature selector. Our experiments on several benchmarks demonstrate, that our approach allows for meaningful and diverse summaries when no labeled data is provided.

[1]  C. Borgelt,et al.  Differentiable Top-k Classification Learning , 2022, International Conference on Machine Learning.

[2]  Yang Wang,et al.  Contrastive Learning for Unsupervised Video Highlight Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  A. Smeulders,et al.  Contrasting quadratic assignments for set-based representation learning , 2022, ECCV.

[4]  Ying Shan,et al.  UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Mubarak Shah,et al.  TCLR: Temporal contrastive learning for video representation , 2021, Comput. Vis. Image Underst..

[6]  Zirui Wang,et al.  Temporal Cue Guided Video Highlight Detection with Low-Rank Audio-Visual Fusion , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Yang Wang,et al.  Joint Visual and Audio Learning for Video Highlight Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Bingbing Ni,et al.  Cross-category Video Highlight Detection via Set-based Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Ali Etemad,et al.  Spatiotemporal Contrastive Learning of Facial Expressions in Videos , 2021, 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII).

[10]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Ralph Ewerth,et al.  Supervised Video Summarization Via Multiple Feature Sets with Parallel Attention , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[12]  Jakob Uszkoreit,et al.  Differentiable Patch Selection for Image Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Filip Grali'nski,et al.  Successive Halving Top-k Operator , 2020, AAAI.

[15]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Weishi Zheng,et al.  MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection , 2020, ECCV.

[17]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[18]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[19]  Hongyuan Zha,et al.  Differentiable Top-k with Optimal Transport , 2020, NeurIPS.

[20]  Zongpu Zhang,et al.  Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks , 2019, ACM Multimedia.

[21]  Esa Rahtu,et al.  Rethinking the Evaluation of Video Summaries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yannis Kalantidis,et al.  Less Is More: Learning Highlight Detection From Video Duration , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ezzeddine Zagrouba,et al.  Key frame extraction for video summarization using local description and repeatability graph clustering , 2018, Signal Image Video Process..

[24]  Paolo Remagnino,et al.  Summarizing Videos with Attention , 2018, ACCV Workshops.

[25]  Tianbao Yang,et al.  How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization , 2018, ECCV.

[26]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[28]  Anastasios Tefas,et al.  Regularized Svd-Based Video Frame Saliency for Unsupervised Activity Video Summarization , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[30]  Graham Neubig,et al.  A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models , 2017, AAAI.

[31]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[32]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[35]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jonathan Tompson,et al.  Unsupervised Learning of Spatiotemporally Coherent Metrics , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[38]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[39]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[40]  Shaohui Mei,et al.  L2,0 constrained sparse dictionary selection for video summarization , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[41]  Ananda S. Chowdhury,et al.  Video key frame extraction through dynamic Delaunay clustering with a structural constraint , 2013, J. Vis. Commun. Image Represent..

[42]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[43]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[44]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[45]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[46]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[47]  Chong-Wah Ngo,et al.  Automatic video summarization by graph modeling , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[48]  Xin Liu,et al.  Video summarization using singular value decomposition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[49]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .