Extractive Video Summarizer with Memory Augmented Neural Networks

Online videos have been growing explosively in recent years. How to help human users efficiently browse videos becomes more and more important. Video summarization can automatically shorten a video through extracting key-shots from the raw video, which is helpful for digesting video data. State-of-the-art supervised video summarization algorithms directly learn from manually-created summaries to mimic the key-frame/key-shot selection criterion of humans. Humans usually create a summary after viewing and understanding the whole video, and the global attention mechanism capturing information from all video frames plays a key role in the summarization process. However, previous supervised approaches ignored the temporal relations or simply modeled local inter-dependency across frames. Motivated by this observation, we proposed a memory augmented extractive video summarizer, which utilizes an external memory to record visual information of the whole video with high capacity. With the external memory, the video summarizer simply predicts the importance score of a video shot based on the global understanding of the video frames. The proposed method outperforms previous state-of-the-art algorithms on the public SumMe and TVSum datasets. More importantly, we demonstrate that the global attention modeling has two advantages: good transferring ability across datasets and high robustness to noisy videos.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Simon Osindero,et al.  Cross-Dimensional Weighting for Aggregated Deep Convolutional Features , 2015, ECCV Workshops.

[4]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[5]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[6]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[7]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[10]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[11]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[12]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[13]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Anoop Cherian,et al.  Video Representation Learning Using Discriminative Pooling , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Boqing Gong,et al.  Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[18]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[20]  Weisi Lin,et al.  Scene-Based Movie Summarization Via Role-Community Networks , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Tal Hassner,et al.  Temporal Tessellation: A Unified Approach for Video Analysis , 2016, ICCV.

[22]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[23]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Wei Zhang,et al.  Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[26]  Marc'Aurelio Ranzato,et al.  Learning Longer Memory in Recurrent Neural Networks , 2014, ICLR.

[27]  William J. Christmas,et al.  Video Shot Cut Detection using Adaptive Thresholding , 2000, BMVC.

[28]  Eric P. Xing,et al.  Unsupervised Object-Level Video Summarization with Online Motion Auto-Encoder , 2018, Pattern Recognit. Lett..

[29]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Yale Song,et al.  To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos , 2016, CIKM.

[32]  Jason Weston,et al.  Large-scale Simple Question Answering with Memory Networks , 2015, ArXiv.

[33]  Chong-Wah Ngo,et al.  Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Ye Yuan,et al.  Video Representation Learning and Latent Concept Mining for Large-scale Multi-label Video Classification , 2017, ArXiv.

[35]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[36]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Youssef Hadi,et al.  Video summarization by k-medoid clustering , 2006, SAC '06.

[38]  Gunhee Kim,et al.  A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Bo Zhang,et al.  A Formal Study of Shot Boundary Detection , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[40]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[41]  Zhe-Ming Lu,et al.  Video abstraction based on the visual attention model and online clustering , 2013, Signal Process. Image Commun..

[42]  Robert Marich Marketing to Moviegoers: A Handbook of Strategies and Tactics , 2009 .

[43]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[44]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.