Video Summarization Using Deep Neural Networks: A Survey

Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the suitability of evaluation protocols, we indicate potential future research directions.

[1]  Wei-Ta Chu,et al.  Spatiotemporal Modeling and Label Distribution Learning for Video Summarization , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[2]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bingbing Ni,et al.  Video Summarization via Semantic Attended Networks , 2018, AAAI.

[4]  Yiyan Chen,et al.  Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning , 2019, MMAsia.

[5]  Andrea Cavallaro,et al.  Video Summarisation by Classification with Deep Reinforcement Learning , 2018, BMVC.

[6]  Yu Zhou,et al.  MSMO: Multimodal Summarization with Multimodal Output , 2018, EMNLP.

[7]  Eric P. Xing,et al.  Unsupervised Object-Level Video Summarization with Online Motion Auto-Encoder , 2018, Pattern Recognit. Lett..

[8]  Muhammad Shakir,et al.  Video Summarization: Techniques and Classification , 2012, ICCVG.

[9]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Esa Rahtu,et al.  Rethinking the Evaluation of Video Summaries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Tao Mei,et al.  Video Collage: A Novel Presentation of Video Sequence , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[12]  Nikolas P. Galatsanos,et al.  Efficient Video Shot Summarization Using an Enhanced Spectral Clustering Approach , 2008, ICANN.

[13]  Lei Xie,et al.  Category driven deep recurrent neural network for video summarization , 2016, 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[14]  Joon Lee,et al.  Video Highlight Prediction Using Audience Chat Reactions , 2017, EMNLP.

[15]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[17]  Svetlana Lazebnik,et al.  Enhancing Video Summarization via Vision-Language Embedding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Hamid Reza Pourreza,et al.  Flexible soccer video summarization in compressed domain , 2013, ICCKE 2013.

[19]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Hwann-Tzong Chen,et al.  Attentive and Adversarial Learning for Video Summarization , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  Cheng Huang,et al.  A Novel Key-Frames Selection Framework for Comprehensive Video Summarization , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Mohamed A. Ismail,et al.  Unsupervised Video Summarization via Dynamic Modeling-Based Hierarchical Clustering , 2013, 2013 12th International Conference on Machine Learning and Applications.

[25]  Sung Wook Baik,et al.  Adaptive key frame extraction for video summarization using an aggregation mechanism , 2012, J. Vis. Commun. Image Represent..

[26]  Ali Borji,et al.  Video Summarization Via Actionness Ranking , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[27]  H. Isil Bozma,et al.  Video Summarization via Segments Summary Graphs , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[28]  Ioannis Patras,et al.  Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods , 2020, ACM Multimedia.

[29]  Frédéric Precioso,et al.  A Deep Architecture for Multimodal Summarization of Soccer Games , 2019, MMSports '19.

[30]  Eugenia Koblents,et al.  Video Summarization with LSTM and Deep Attention Models , 2018, MMM.

[31]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[34]  Michael Kampffmeyer,et al.  DTR-GAN: dilated temporal relational adversarial network for video summarization , 2018, ACM TUR-C.

[35]  Chia-Hung Yeh,et al.  Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques , 2006, IEEE Signal Processing Magazine.

[36]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Yu-Chiang Frank Wang,et al.  Summarizing First-Person Videos from Third Persons' Points of Views , 2017, ECCV.

[39]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[40]  Haoran Li,et al.  Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video , 2017, EMNLP.

[41]  Shaohui Mei,et al.  A Top-Down Approach for Video Summarization , 2014, TOMM.

[42]  Chong-Wah Ngo,et al.  Summarizing Rushes Videos by Motion, Object, and Event Understanding , 2012, IEEE Transactions on Multimedia.

[43]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[44]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[45]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Ioannis Patras,et al.  A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization , 2019, AI4TV@MM.

[47]  Cheng Deng,et al.  Balanced Self-Paced Learning for Generative Adversarial Clustering Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Wei Zhang,et al.  Extractive Video Summarizer with Memory Augmented Neural Networks , 2018, ACM Multimedia.

[49]  Janko Calic,et al.  Efficient Layout of Comic-Like Video Summaries , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[50]  No Author Given Semantic Based Adaptive Movie Summarisation , 2009 .

[51]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[52]  Zongpu Zhang,et al.  Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks , 2019, ACM Multimedia.

[53]  Guillermo Cámara Chávez,et al.  A New Method for Static Video Summarization Using Local Descriptors and Video Temporal Segmentation , 2013, 2013 XXVI Conference on Graphics, Patterns and Images.

[54]  Indu Sreedevi,et al.  Online Video Summarization: Predicting Future to Better Summarize Present , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[55]  Jurandy Almeida,et al.  VISON: VIdeo Summarization for ONline applications , 2012, Pattern Recognit. Lett..

[56]  Ying Li,et al.  An Overview of Video Abstraction Techniques , 2001 .

[57]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[58]  Petros Maragos,et al.  Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[59]  Ioannis Patras,et al.  Unsupervised Video Summarization via Attention-Driven Adversarial Learning , 2019, MMM.

[60]  Xiao Liu,et al.  Action Parsing-Driven Video Summarization Based on Reinforcement Learning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[61]  Mauro Barbieri,et al.  Video summarization: methods and landscape , 2003, SPIE ITCom.

[62]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[63]  Tie-Yan Liu,et al.  Shot reconstruction degree: a novel criterion for key frame selection , 2004, Pattern Recognit. Lett..

[64]  Paolo Remagnino,et al.  Summarizing Videos with Attention , 2018, ACCV Workshops.

[65]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[66]  Junsong Yuan,et al.  Video Summarization Via Multiview Representative Selection , 2018, IEEE Transactions on Image Processing.

[67]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[68]  Joo-Hwee Lim,et al.  Summarization of Egocentric Videos: A Comprehensive Survey , 2017, IEEE Transactions on Human-Machine Systems.

[69]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[70]  Larry S. Davis,et al.  Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior , 2018, ECCV.

[71]  Chen Li,et al.  Automatic Movie Summarization Based on the Visual-Audio Features , 2014, 2014 IEEE 17th International Conference on Computational Science and Engineering.

[72]  Xuelong Li,et al.  Property-Constrained Dual Learning for Video Summarization , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[73]  Georges Linarès,et al.  Static and dynamic video summaries , 2011, MM '11.

[74]  Matthieu Cord,et al.  VSUMM: An Approach for Automatic Video Summarization and Quantitative Evaluation , 2008, 2008 XXI Brazilian Symposium on Computer Graphics and Image Processing.

[75]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[76]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[77]  Vasileios Mezaris,et al.  A Web Service for Video Summarization , 2020, IMX.

[78]  Priyanka Sharma,et al.  Survey of Compressed Domain Video Summarization Techniques , 2019, ACM Comput. Surv..

[79]  W. Beyer CRC Standard Probability And Statistics Tables and Formulae , 1990 .

[80]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Summarization of Web Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[81]  Haopeng Li,et al.  Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network , 2019, IEEE Access.

[82]  Marius Leordeanu,et al.  Image Difficulty Curriculum for Generative Adversarial Networks (CuGAN) , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[83]  M. Kendall The treatment of ties in ranking problems. , 1945, Biometrika.

[84]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Xuelong Li,et al.  A General Framework for Edited Video and Raw Video Summarization , 2017, IEEE Transactions on Image Processing.

[86]  Ping Li,et al.  Cycle-SUM: Cycle-consistent Adversarial LSTM Networks for Unsupervised Video Summarization , 2019, AAAI.

[87]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[88]  Yang Wang,et al.  Video Summarization by Learning From Unpaired Data , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Naokazu Yokoya,et al.  Video Summarization Using Deep Semantic Features , 2016, ACCV.

[90]  Florian Metze,et al.  Multimodal Abstractive Summarization for How2 Videos , 2019, ACL.

[91]  Ling Shao,et al.  Deep attentive and semantic preserving video summarization , 2020, Neurocomputing.

[92]  Sung Wook Baik,et al.  Feature aggregation based visual attention model for video summarization , 2014, Comput. Electr. Eng..

[93]  Sang Uk Lee,et al.  Efficient video indexing scheme for content-based retrieval , 1999, IEEE Trans. Circuits Syst. Video Technol..

[94]  In-So Kweon,et al.  Discriminative Feature Learning for Unsupervised Video Summarization , 2018, AAAI.

[95]  Fu-En Yang,et al.  Learning Hierarchical Self-Attention for Video Summarization , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[96]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[98]  Danny Crookes,et al.  Advances in Video Summarization and Skimming , 2009 .

[99]  Tieniu Tan,et al.  Stacked Memory Network for Video Summarization , 2019, ACM Multimedia.

[100]  Joelle Pineau,et al.  Online Adaptative Curriculum Learning for GANs , 2018, AAAI.

[101]  Shaohui Mei,et al.  Video summarization via minimum sparse reconstruction , 2015, Pattern Recognit..

[102]  Adriano M. Pereira,et al.  A video summarization approach based on the emulation of bottom-up mechanisms of visual attention , 2017, Journal of Intelligent Information Systems.

[103]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[104]  Juan Carlos Niebles,et al.  Title Generation for User Generated Videos , 2016, ECCV.

[105]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[106]  Yujie Li,et al.  Extracting key frames from first-person videos in the common space of multiple sensors , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[107]  Tao Mei,et al.  Video Summarization by Learning Deep Side Semantic Embedding , 2019, IEEE Transactions on Circuits and Systems for Video Technology.