Action Parsing-Driven Video Summarization Based on Reinforcement Learning

How to manage, store, and index large numbers of videos is an urgent problem to be solved. Although there are many video summarization models achieving good results, models based on low-level features cannot summarize important semantic information and models based on semantic analysis need related text descriptions that do not exist for most videos. As a consequence, the mining semantic information contained in the video itself is a more feasible way. In this paper, we propose an action parsing-driven video summarization model based on reinforcement learning. The model is mainly divided into two parts, video cut by action parsing and video summarization based on reinforcement learning. In the first part, a sequential multiple instance learning model is trained with weakly annotated data to solve the problem of full annotation’s time consuming and weak annotation’s ambiguity. In the second part, we design a deep recurrent neural network-based video summarization model that selects the most distinguishable frames comparing with other actions. Meanwhile, the quality of the extracted key frames could be evaluated by the categorization accuracy. Experiments and comparison with state-of-the-art methods demonstrate the advantage of the proposed approach.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[3]  Zygmunt Pizlo,et al.  Automated video program summarization using speech transcripts , 2006, IEEE Transactions on Multimedia.

[4]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ming-Syan Chen,et al.  Video Event Detection by Inferring Temporal Instance Labels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Yihong Gong Summarizing Audiovisual Contents of a Video Program , 2003, EURASIP J. Adv. Signal Process..

[9]  Derek Hoiem,et al.  Learning CRFs Using Graph Cuts , 2008, ECCV.

[10]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[11]  Xiao Liu,et al.  Fully Convolutional Attention Networks for Fine-Grained Recognition , 2016 .

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Shaogang Gong,et al.  Video Synopsis by Heterogeneous Multi-source Correlation , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Tieniu Tan,et al.  A Discriminative Model of Motion and Cross Ratio for View-Invariant Action Recognition , 2012, IEEE Transactions on Image Processing.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[21]  Stan Z. Li,et al.  Online content-aware video condensation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[24]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Janusz Konrad,et al.  Action Recognition From Video Using Feature Covariance Matrices , 2013, IEEE Transactions on Image Processing.

[27]  Lei Xie,et al.  Category driven deep recurrent neural network for video summarization , 2016, 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[28]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Meng Wang,et al.  A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition , 2015, International Journal of Computer Vision.

[31]  Liang Lin,et al.  Attention-Aware Face Hallucination via Deep Reinforcement Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[33]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[34]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[35]  Jiajun Bu,et al.  Video Summarization based on Nonnegative Linear Reconstruction , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[36]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[37]  Yu Zhang,et al.  High-level representation sketch for video event retrieval , 2015, Science China Information Sciences.

[38]  Yoshua Bengio,et al.  Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[39]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[40]  L. R. Rabiner,et al.  A comparative study of several dynamic time-warping algorithms for connected-word recognition , 1981, The Bell System Technical Journal.

[41]  Shuicheng Yan,et al.  SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[42]  Cordelia Schmid,et al.  Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[43]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).