Crowd aware summarization of surveillance videos by deep reinforcement learning

Surveillance videos which record crowd behaviors have dramatically increased due to the wide applications. A quick view of such crowd surveillance video in a constrained time is an increasing demand because it always contain a huge number of redundancy frames. In this paper, we focus on summarization of crowd surveillance videos. But it is not easy due to two reasons. First, how to make the decision to keep or discard a subshot from the input surveillance video stream so that the summary can outline the main behaviors of the crowd over a limited frames sequence. Second, how to maintain performance of summarization model for long surveillance videos. To tackle these challenges, we formulate surveillance video summarization as a sequential decision-making process and train the summarization network with reinforcement learning-based framework. A novel crowd location-density reward is proposed to teach summarization network to produce high-quality summaries. In addition, a summarization network with three layers LSTM is designed to maintain performance across longer time spans. Extensive experiments on three public crowd surveillance videos datasets show that the proposed method achieves state-of-the-art performance.

[1]  Brighten Godfrey,et al.  A Deep Reinforcement Learning Perspective on Internet Congestion Control , 2019, ICML.

[2]  Haroon Idrees,et al.  Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds , 2018, ECCV.

[3]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[5]  Li Sun,et al.  Event-based large scale surveillance video summarization , 2016, Neurocomputing.

[6]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[8]  Jean-Luc Dugelay,et al.  Towards crowd density-aware video surveillance applications , 2015, Inf. Fusion.

[9]  Chong-Wah Ngo,et al.  Automatic video summarization by graph modeling , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[10]  Luc Van Gool,et al.  Query-adaptive Video Summarization via Quality-aware Relevance Estimation , 2017, ACM Multimedia.

[11]  Tieniu Tan,et al.  Attention-Aware Sampling via Deep Reinforcement Learning for Action Recognition , 2019, AAAI.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Qian Zhang,et al.  FrameRank: A Text Processing Approach to Video Summarization , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[14]  Yasuyuki Matsushita,et al.  Space-Time Video Montage , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Kristen Grauman,et al.  Large-Margin Determinantal Point Processes , 2014, UAI.

[17]  Andrea Cavallaro,et al.  Video Summarisation by Classification with Deep Reinforcement Learning , 2018, BMVC.

[18]  Vishal M. Patel,et al.  CNN-Based cascaded multi-task learning of high-level prior and density estimation for crowd counting , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[19]  Qian Li,et al.  Dynamic node selection in camera networks based on approximate reinforcement learning , 2016, Multimedia Tools and Applications.

[20]  Yael Pritch,et al.  Webcam Synopsis: Peeking Around the World , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[23]  Tomás Pevný,et al.  Classification with Costly Features using Deep Reinforcement Learning , 2019, AAAI.

[24]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[25]  Bing Zhou,et al.  Abnormal Event Detection and Location for Dense Crowds using Repulsive Forces and Sparse Reconstruction , 2018, ArXiv.

[26]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[27]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[28]  John R. Kender,et al.  Optimization Algorithms for the Selection of Key Frame Sequences of Variable Length , 2002, ECCV.

[29]  Jiwen Lu,et al.  Summarizing surveillance videos with local-patch-learning-based abnormality detection, blob sequence optimization, and type-based synopsis , 2015, Neurocomputing.

[30]  Bernard Mérialdo,et al.  Multi-video summarization based on AV-MMR , 2010, 2010 International Workshop on Content Based Multimedia Indexing (CBMI).

[31]  V. Berger Selection Bias and Covariate Imbalances in Randomized Clinical Trials: Berger/Selection Bias and Covariate Imbalances in Randomized Clinical Trials , 2005 .

[32]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[33]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Haidi Ibrahim,et al.  Recent survey on crowd density estimation and counting for visual surveillance , 2015, Eng. Appl. Artif. Intell..

[35]  Nannan Li,et al.  Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Hung-Khoon Tan,et al.  Event driven summarization for web videos , 2009, WSM '09.

[37]  Zhen Gao,et al.  Key-frame selection for automatic summarization of surveillance videos: a method of multiple change-point detection , 2018, Machine Vision and Applications.

[38]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[39]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[41]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Boqing Gong,et al.  Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Hung-Khoon Tan,et al.  Beyond search: Event-driven summarization for web videos , 2011, TOMCCAP.

[45]  Wei Zhang,et al.  Extractive Video Summarizer with Memory Augmented Neural Networks , 2018, ACM Multimedia.

[46]  Xiaogang Wang,et al.  Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Guoliang Lu,et al.  Unsupervised, efficient and scalable key-frame selection for automatic summarization of surveillance videos , 2017, Multimedia Tools and Applications.

[49]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[50]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Ling Shao,et al.  See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).