Video Summarization Using Deep Neural Networks: A Survey

Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades, and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms, and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the amount of annotated data and the suitability of evaluation protocols, we indicate potential future research directions.

[1]  Sajid Javed,et al.  Graph Moving Object Segmentation , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Chenliang Li,et al.  A Survey on Deep Learning for Named Entity Recognition , 2018, IEEE Transactions on Knowledge and Data Engineering.

[3]  Shaohui Mei,et al.  Similarity Based Block Sparse Subset Selection for Video Summarization , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Ioannis Patras,et al.  AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Khushnood Abbas,et al.  A survey on deep learning and its applications , 2021, Comput. Sci. Rev..

[6]  Xuelong Li,et al.  TTH-RNN: Tensor-Train Hierarchical Recurrent Neural Network for Video Summarization , 2021, IEEE Transactions on Industrial Electronics.

[7]  Nazli Ikizler-Cinbis,et al.  Using independently recurrent networks for reinforcement learning based unsupervised video summarization , 2021, Multimedia Tools and Applications.

[8]  Sanjay Silakari,et al.  Deep Learning Algorithms for Cybersecurity Applications: A Technological and Status Review , 2021, Comput. Sci. Rev..

[9]  Luming Zhang,et al.  Exploring global diverse attention via pairwise temporal relation for video summarization , 2020, Pattern Recognit..

[10]  Ioannis Patras,et al.  Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods , 2020, ACM Multimedia.

[11]  Ling Shao,et al.  Deep attentive and semantic preserving video summarization , 2020, Neurocomputing.

[12]  Vasileios Mezaris,et al.  A Web Service for Video Summarization , 2020, IMX.

[13]  Daniel Rueckert,et al.  Ultrasound Video Summarization using Deep Reinforcement Learning , 2020, MICCAI.

[14]  Mingliang Wang,et al.  A Survey on Deep Learning for Neuroimaging-Based Brain Disorder Analysis , 2020, Frontiers in Neuroscience.

[15]  Marcel Worring,et al.  Query-controllable Video Summarization , 2020, ICMR.

[16]  Zhikui Chen,et al.  A Survey on Deep Learning for Multimodal Data Fusion , 2020, Neural Computation.

[17]  Yonina C. Eldar,et al.  Sampling Signals on Graphs: From Theory to Applications , 2020, IEEE Signal Processing Magazine.

[18]  Cheng Huang,et al.  A Novel Key-Frames Selection Framework for Comprehensive Video Summarization , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Ioannis Patras,et al.  Unsupervised Video Summarization via Attention-Driven Adversarial Learning , 2019, MMM.

[20]  Xuelong Li,et al.  Property-Constrained Dual Learning for Video Summarization , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Marius Leordeanu,et al.  Image Difficulty Curriculum for Generative Adversarial Networks (CuGAN) , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  Tiberiu T. Cocias,et al.  A survey of deep learning techniques for autonomous driving , 2019, J. Field Robotics.

[23]  Priyanka Sharma,et al.  Survey of Compressed Domain Video Summarization Techniques , 2019, ACM Comput. Surv..

[24]  Eric P. Xing,et al.  Unsupervised Object-Level Video Summarization with Online Motion Auto-Encoder , 2018, Pattern Recognit. Lett..

[25]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  In-So Kweon,et al.  Global-and-Local Relative Position Embedding for Unsupervised Video Summarization , 2020, ECCV.

[27]  Yiyan Chen,et al.  Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning , 2019, MMAsia.

[28]  Ioannis Patras,et al.  A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization , 2019, AI4TV@MM.

[29]  Tieniu Tan,et al.  Stacked Memory Network for Video Summarization , 2019, ACM Multimedia.

[30]  Zongpu Zhang,et al.  Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks , 2019, ACM Multimedia.

[31]  Frédéric Precioso,et al.  A Deep Architecture for Multimodal Summarization of Soccer Games , 2019, MMSports '19.

[32]  Wei-Ta Chu,et al.  Spatiotemporal Modeling and Label Distribution Learning for Video Summarization , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[33]  Fu-En Yang,et al.  Learning Hierarchical Self-Attention for Video Summarization , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[34]  Xiao Liu,et al.  Action Parsing-Driven Video Summarization Based on Reinforcement Learning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Cheng Deng,et al.  Balanced Self-Paced Learning for Generative Adversarial Clustering Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Florian Metze,et al.  Multimodal Abstractive Summarization for How2 Videos , 2019, ACL.

[37]  Ping Li,et al.  Cycle-SUM: Cycle-consistent Adversarial LSTM Networks for Unsupervised Video Summarization , 2019, AAAI.

[38]  Esa Rahtu,et al.  Rethinking the Evaluation of Video Summaries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Michael Kampffmeyer,et al.  Deep Reinforcement Learning for Query-Conditioned Video Summarization , 2019, Applied Sciences.

[40]  Hwann-Tzong Chen,et al.  Attentive and Adversarial Learning for Video Summarization , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  Ali Borji,et al.  Video Summarization Via Actionness Ranking , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[42]  Tao Mei,et al.  Video Summarization by Learning Deep Side Semantic Embedding , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[43]  Eugenia Koblents,et al.  Video Summarization with LSTM and Deep Attention Models , 2018, MMM.

[44]  In-So Kweon,et al.  Discriminative Feature Learning for Unsupervised Video Summarization , 2018, AAAI.

[45]  Soon Ki Jung,et al.  Deep Neural Network Concepts for Background Subtraction: A Systematic Review and Comparative Evaluation , 2018, Neural Networks.

[46]  Joelle Pineau,et al.  Online Adaptative Curriculum Learning for GANs , 2018, AAAI.

[47]  Yang Wang,et al.  Video Summarization by Learning From Unpaired Data , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Michael Kampffmeyer,et al.  DTR-GAN: dilated temporal relational adversarial network for video summarization , 2018, ACM TUR-C.

[49]  Antonia Creswell,et al.  Denoising Adversarial Autoencoders , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[50]  Indu Sreedevi,et al.  Online Video Summarization: Predicting Future to Better Summarize Present , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[51]  Haopeng Li,et al.  Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network , 2019, IEEE Access.

[52]  Paolo Remagnino,et al.  Summarizing Videos with Attention , 2018, ACCV Workshops.

[53]  Ashish Khetan,et al.  Robustness of Conditional GANs to Noisy Labels , 2018, NeurIPS.

[54]  Wei Zhang,et al.  Extractive Video Summarizer with Memory Augmented Neural Networks , 2018, ACM Multimedia.

[55]  Larry S. Davis,et al.  Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior , 2018, ECCV.

[56]  Andrea Cavallaro,et al.  Video Summarisation by Classification with Deep Reinforcement Learning , 2018, BMVC.

[57]  Michael Kampffmeyer,et al.  Query-Conditioned Three-Player Adversarial Network for Video Summarization , 2018, BMVC.

[58]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Guosheng Lin,et al.  MoNet: Deep Motion Exploitation for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  S. Roth,et al.  Lightweight Probabilistic Deep Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[62]  Bingbing Ni,et al.  Video Summarization via Semantic Attended Networks , 2018, AAAI.

[63]  Shuai Li,et al.  Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Suvrit Sra,et al.  A Critical View of Global Optimality in Deep Learning , 2018, ArXiv.

[65]  Enrico Magli,et al.  Graph Spectral Image Processing , 2018, Proceedings of the IEEE.

[66]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[67]  Junsong Yuan,et al.  Video Summarization Via Multiview Representative Selection , 2018, IEEE Transactions on Image Processing.

[68]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[69]  Pierre Vandergheynst,et al.  Graph Signal Processing: Overview, Challenges, and Applications , 2017, Proceedings of the IEEE.

[70]  Yu-Chiang Frank Wang,et al.  Summarizing First-Person Videos from Third Persons' Points of Views , 2017, ECCV.

[71]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[73]  Zhihao Zheng,et al.  Robust Detection of Adversarial Attacks by Modeling the Intrinsic Properties of Deep Neural Networks , 2018, NeurIPS.

[74]  Yu Zhou,et al.  MSMO: Multimodal Summarization with Multimodal Output , 2018, EMNLP.

[75]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[76]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Summarization of Web Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[77]  Yujie Li,et al.  Extracting key frames from first-person videos in the common space of multiple sensors , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[78]  Haoran Li,et al.  Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video , 2017, EMNLP.

[79]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Joon Lee,et al.  Video Highlight Prediction Using Audience Chat Reactions , 2017, EMNLP.

[81]  Svetlana Lazebnik,et al.  Enhancing Video Summarization via Vision-Language Embedding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[83]  Luc Van Gool,et al.  Query-adaptive Video Summarization via Quality-aware Relevance Estimation , 2017, ACM Multimedia.

[84]  Yurong Liu,et al.  A survey of deep neural network architectures and their applications , 2017, Neurocomputing.

[85]  Xuelong Li,et al.  A General Framework for Edited Video and Raw Video Summarization , 2017, IEEE Transactions on Image Processing.

[86]  Joo-Hwee Lim,et al.  Active Video Summarization: Customized Summaries via On-line Interaction with the User , 2017, AAAI.

[87]  Joo-Hwee Lim,et al.  Summarization of Egocentric Videos: A Comprehensive Survey , 2017, IEEE Transactions on Human-Machine Systems.

[88]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[89]  Daniel Jurafsky,et al.  Building DNN acoustic models for large vocabulary speech recognition , 2014, Comput. Speech Lang..

[90]  Adriano M. Pereira,et al.  A video summarization approach based on the emulation of bottom-up mechanisms of visual attention , 2017, Journal of Intelligent Information Systems.

[91]  Naokazu Yokoya,et al.  Video Summarization Using Deep Semantic Features , 2016, ACCV.

[92]  Juan Carlos Niebles,et al.  Title Generation for User Generated Videos , 2016, ECCV.

[93]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[94]  Mubarak Shah,et al.  Query-Focused Extractive Video Summarization , 2016, ECCV.

[95]  Lei Xie,et al.  Category driven deep recurrent neural network for video summarization , 2016, 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[96]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[97]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Ming Zhou,et al.  Adaptive Multi-Compositionality for Recursive Neural Network Models , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[99]  N. Rajpoot,et al.  Locality Sensitive Deep Learning for Detection and Classification of Nuclei in Routine Colon Cancer Histology Images , 2016, IEEE Trans. Medical Imaging.

[100]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[101]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  H. Isil Bozma,et al.  Video Summarization via Segments Summary Graphs , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[103]  Petros Maragos,et al.  Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[104]  Kai Yu,et al.  An investigation on DNN-derived bottleneck features for GMM-HMM based robust speech recognition , 2015, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP).

[105]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[106]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[107]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[108]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[109]  Mathieu Aubry,et al.  Understanding Deep Features with Computer-Generated Imagery , 2015, ICCV.

[110]  Shaohui Mei,et al.  Video summarization via minimum sparse reconstruction , 2015, Pattern Recognit..

[111]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[112]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[113]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[114]  Chen Li,et al.  Automatic Movie Summarization Based on the Visual-Audio Features , 2014, 2014 IEEE 17th International Conference on Computational Science and Engineering.

[115]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[116]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[117]  Yi Yang,et al.  Dynamic Background Learning through Deep Auto-encoder Networks , 2014, ACM Multimedia.

[118]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[119]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[120]  Shaohui Mei,et al.  A Top-Down Approach for Video Summarization , 2014, TOMM.

[121]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[122]  Sung Wook Baik,et al.  Feature aggregation based visual attention model for video summarization , 2014, Comput. Electr. Eng..

[123]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[124]  Hamid Reza Pourreza,et al.  Flexible soccer video summarization in compressed domain , 2013, ICCKE 2013.

[125]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[126]  Mohamed A. Ismail,et al.  Unsupervised Video Summarization via Dynamic Modeling-Based Hierarchical Clustering , 2013, 2013 12th International Conference on Machine Learning and Applications.

[127]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[128]  Guillermo Cámara Chávez,et al.  A New Method for Static Video Summarization Using Local Descriptors and Video Temporal Segmentation , 2013, 2013 XXVI Conference on Graphics, Patterns and Images.

[129]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[130]  Holger Schwenk,et al.  Continuous Space Translation Models for Phrase-Based Statistical Machine Translation , 2012, COLING.

[131]  Sung Wook Baik,et al.  Adaptive key frame extraction for video summarization using an aggregation mechanism , 2012, J. Vis. Commun. Image Represent..

[132]  Muhammad Shakir,et al.  Video Summarization: Techniques and Classification , 2012, ICCVG.

[133]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[134]  Jurandy Almeida,et al.  VISON: VIdeo Summarization for ONline applications , 2012, Pattern Recognit. Lett..

[135]  Chong-Wah Ngo,et al.  Summarizing Rushes Videos by Motion, Object, and Event Understanding , 2012, IEEE Transactions on Multimedia.

[136]  Georges Linarès,et al.  Static and dynamic video summaries , 2011, MM '11.

[137]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[138]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[139]  Danny Crookes,et al.  Advances in Video Summarization and Skimming , 2009 .

[140]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[141]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[142]  No Author Given Semantic Based Adaptive Movie Summarisation , 2009 .

[143]  Matthieu Cord,et al.  VSUMM: An Approach for Automatic Video Summarization and Quantitative Evaluation , 2008, 2008 XXI Brazilian Symposium on Computer Graphics and Image Processing.

[144]  Nikolas P. Galatsanos,et al.  Efficient Video Shot Summarization Using an Enhanced Spectral Clustering Approach , 2008, ICANN.

[145]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[146]  Tao Mei,et al.  Video Collage: A Novel Presentation of Video Sequence , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[147]  Janko Calic,et al.  Efficient Layout of Comic-Like Video Summaries , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[148]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[149]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[150]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[151]  Chia-Hung Yeh,et al.  Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques , 2006, IEEE Signal Processing Magazine.

[152]  Tie-Yan Liu,et al.  Shot reconstruction degree: a novel criterion for key frame selection , 2004, Pattern Recognit. Lett..

[153]  Mauro Barbieri,et al.  Video summarization: methods and landscape , 2003, SPIE ITCom.

[154]  Ying Li,et al.  An Overview of Video Abstraction Techniques , 2001 .

[155]  Sang Uk Lee,et al.  Efficient video indexing scheme for content-based retrieval , 1999, IEEE Trans. Circuits Syst. Video Technol..

[156]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[157]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[158]  Christoph Goller,et al.  Learning task-dependent distributed representations by backpropagation through structure , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[159]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[160]  W. Beyer CRC Standard Probability And Statistics Tables and Formulae , 1990 .

[161]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[162]  M. Kendall The treatment of ties in ranking problems. , 1945, Biometrika.