Property-Constrained Dual Learning for Video Summarization

Video summarization is the technique to condense large-scale videos into summaries composed of key-frames or key-shots so that the viewers can browse the video content efficiently. Recently, supervised approaches have achieved great success by taking advantages of recurrent neural networks (RNNs). Most of them focus on generating summaries by maximizing the overlap between the generated summary and the ground truth. However, they neglect the most critical principle, i.e., whether the viewer can infer the original video content from the summary. As a result, existing approaches cannot preserve the summary quality well and usually demand large amounts of training data to reduce overfitting. In our view, video summarization has two tasks, i.e., generating summaries from videos and inferring the original content from summaries. Motivated by this, we propose a dual learning framework by integrating the summary generation (primal task) and video reconstruction (dual task) together, which targets to reward the summary generator under the assistance of the video reconstructor. Moreover, to provide more guidance to the summary generator, two property models are developed to measure the representativeness and diversity of the generated summary. Practically, experiments on four popular data sets (SumMe, TVsum, OVP, and YouTube) have demonstrated that our approach, with compact RNNs as the summary generator, using less training data, and even in the unsupervised setting, can get comparable performance with those supervised ones adopting more complex summary generators and trained on more annotated data.

[1]  Tao Mei,et al.  A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization , 2014, IEEE Transactions on Multimedia.

[2]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[3]  John R. Kender,et al.  Optimization Algorithms for the Selection of Key Frame Sequences of Variable Length , 2002, ECCV.

[4]  Xiaoqiang Lu,et al.  Key Frame Extraction in the Summary Space , 2018, IEEE Transactions on Cybernetics.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[7]  Huaguang Zhang,et al.  A Comprehensive Review of Stability Analysis of Continuous-Time Recurrent Neural Networks , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[9]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xuelong Li,et al.  A General Framework for Edited Video and Raw Video Summarization , 2017, IEEE Transactions on Image Processing.

[11]  Youssef Hadi,et al.  Video summarization by k-medoid clustering , 2006, SAC '06.

[12]  Haopeng Li,et al.  Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network , 2019, IEEE Access.

[13]  Hai Su,et al.  Deep Learning in Microscopy Image Analysis: A Survey , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[16]  Junsong Yuan,et al.  Video Summarization via Multi-view Representative Selection , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[19]  Jiebo Luo,et al.  Adaptive Greedy Dictionary Selection for Web Media Summarization , 2017, IEEE Transactions on Image Processing.

[20]  Tao Mei,et al.  Video Summarization by Learning Deep Side Semantic Embedding , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Rushil Anirudh,et al.  Diversity promoting online sampling for streaming video summarization , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[22]  Yann Dauphin,et al.  Deal or No Deal? End-to-End Learning of Negotiation Dialogues , 2017, EMNLP.

[23]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Guillermo Sapiro,et al.  See all by looking at a few: Sparse modeling for finding representative objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[26]  Ananda S. Chowdhury,et al.  Video key frame extraction through dynamic Delaunay clustering with a structural constraint , 2013, J. Vis. Commun. Image Represent..

[27]  Shaohui Mei,et al.  Video summarization via minimum sparse reconstruction , 2015, Pattern Recognit..

[28]  Jiajun Bu,et al.  Video Summarization based on Nonnegative Linear Reconstruction , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[29]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Mohamed A. Ismail,et al.  Unsupervised Video Summarization via Dynamic Modeling-Based Hierarchical Clustering , 2013, 2013 12th International Conference on Machine Learning and Applications.

[33]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[34]  Qi Wang,et al.  PCC Net: Perspective Crowd Counting via Spatial Convolutional Network , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Chong-Wah Ngo,et al.  Automatic video summarization by graph modeling , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[36]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[37]  Corey H Basch,et al.  YouTube Videos as a Source of Information About Clinical Trials: Observational Study , 2018, JMIR cancer.

[38]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bingbing Ni,et al.  Video Summarization via Semantic Attended Networks , 2018, AAAI.

[40]  Lixin Duan,et al.  Action and Event Recognition in Videos by Learning From Heterogeneous Web Sources , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Wei Lin,et al.  Learning From Synthetic Data for Crowd Counting in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[43]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Zhenfeng Zhu,et al.  Seeing All From a Few: $\ell_{1}$ -Norm-Induced Discriminative Prototype Selection , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[45]  Xuelong Li,et al.  From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[46]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[47]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[48]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[51]  Junsong Yuan,et al.  Video Summarization Via Multiview Representative Selection , 2018, IEEE Transactions on Image Processing.

[52]  Yang Wang,et al.  Video Summarization by Learning From Unpaired Data , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  John R. Kender,et al.  Video Summaries through Mosaic-Based Shot and Scene Clustering , 2002, ECCV.