Deep Attentive Video Summarization With Distribution Consistency Learning

This article studies supervised video summarization by formulating it into a sequence-to-sequence learning framework, in which the input and output are sequences of original video frames and their predicted importance scores, respectively. Two critical issues are addressed in this article: short-term contextual attention insufficiency and distribution inconsistency. The former lies in the insufficiency of capturing the short-term contextual attention information within the video sequence itself since the existing approaches focus a lot on the long-term encoder-decoder attention. The latter refers to the distributions of predicted importance score sequence and the ground-truth sequence is inconsistent, which may lead to a suboptimal solution. To better mitigate the first issue, we incorporate a self-attention mechanism in the encoder to highlight the important keyframes in a short-term context. The proposed approach alongside the encoder-decoder attention constitutes our deep attentive models for video summarization. For the second one, we propose a distribution consistency learning method by employing a simple yet effective regularization loss term, which seeks a consistent distribution for the two sequences. Our final approach is dubbed as Attentive and Distribution consistent video Summarization (ADSum). Extensive experiments on benchmark data sets demonstrate the superiority of the proposed ADSum approach against state-of-the-art approaches.

[1]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Zhenfeng Zhu,et al.  Seeing All From a Few: $\ell_{1}$ -Norm-Induced Discriminative Prototype Selection , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Wayne H. Wolf,et al.  Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[7]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Meng Wang,et al.  3-D PersonVLAD: Learning Deep Global Representations for Video-Based Person Reidentification , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[11]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[12]  Xinbo Gao,et al.  Unsupervised video-shot segmentation and model-free anchorperson detection for news video story parsing , 2002, IEEE Trans. Circuits Syst. Video Technol..

[13]  Shih-Ping Liou,et al.  Videoabstract: a hybrid approach to generate semantically meaningful video summaries , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[14]  Ying Li,et al.  An Overview of Video Abstraction Techniques , 2001 .

[15]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[16]  Tianbao Yang,et al.  How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization , 2018, ECCV.

[17]  Xuelong Li,et al.  Query-aware sparse coding for web multi-video summarization , 2019, Inf. Sci..

[18]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[19]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Bingbing Ni,et al.  Video Summarization via Semantic Attended Networks , 2018, AAAI.

[21]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[22]  Hwann-Tzong Chen,et al.  Attentive and Adversarial Learning for Video Summarization , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[25]  Shaohui Mei,et al.  Video summarization via minimum sparse reconstruction , 2015, Pattern Recognit..

[26]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[27]  Xuelong Li,et al.  A General Framework for Edited Video and Raw Video Summarization , 2017, IEEE Transactions on Image Processing.

[28]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[29]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[30]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[31]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[32]  Xuelong Li,et al.  Hypergraph dominant set based multi-video summarization , 2018, Signal Process..

[33]  Yue Wang,et al.  Motion-State-Adaptive Video Summarization via Spatiotemporal Analysis , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Chong-Wah Ngo,et al.  Automatic video summarization by graph modeling , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[36]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[40]  In-So Kweon,et al.  Discriminative Feature Learning for Unsupervised Video Summarization , 2018, AAAI.