论文信息 - LSTM-based multi-label video event detection

LSTM-based multi-label video event detection

Since large-scale surveillance videos always contain complex visual events, how to generate video descriptions effectively and efficiently without human supervision has become mandatory. To address this problem, we propose a novel architecture for jointly recognizing multiple events in a given surveillance video, motivated by the sequence to sequence network. The proposed architecture can predict what happens in a video directly without the preprocessing of object detection and tracking. We evaluate several variants of the proposed architecture with different visual features on a novel dataset perpared by our group. Moreover, we compute a wide range of quantitative metrics to evaluate this architecture. We further compare it to the popular Support Vector Machine-based visual event detection method. The comparison results suggest that the proposal method can outperform the traditional computer vision pipelines for visual event detection.

[1] Dar-Shyang Lee,et al. Effective Gaussian mixture learning for video background subtraction , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Kate Saenko,et al. Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text , 2016, EMNLP.

[3] Takeo Kanade,et al. Introduction to the Special Section on Video Surveillance , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5] Limin Wang,et al. Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[6] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Yangqing Jia,et al. Deep Convolutional Ranking for Multilabel Image Annotation , 2013, ICLR.

[9] Greg Mori,et al. A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[11] Yael Pritch,et al. Clustered Synopsis of Surveillance Video , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[12] Yi Li,et al. R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[13] Li Fei-Fei,et al. VideoSET: Video Summary Evaluation through Text , 2014, ArXiv.

[14] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15] Stan Sclaroff,et al. Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web , 2015, Pattern Recognit..

[16] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17] David J. Crandall,et al. DeepDiary: Automatically Captioning Lifelogging Image Streams , 2016, ECCV Workshops.

[18] Tieniu Tan,et al. Recent developments in human motion analysis , 2003, Pattern Recognit..

[19] Yang Wang,et al. Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[21] Zan Gao,et al. Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition , 2015, Signal Process..

[22] Meng Wang,et al. Oracle in Image Search: A Content-Based Approach to Performance Prediction , 2012, TOIS.

[23] Yanmin Qian,et al. Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24] Cordelia Schmid,et al. Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[25] Hironobu Fujiyoshi,et al. Real-time human motion analysis by image skeletonization , 1998, Proceedings Fourth IEEE Workshop on Applications of Computer Vision. WACV'98 (Cat. No.98EX201).

[26] Jürgen Schmidhuber,et al. LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[27] Tat-Seng Chua,et al. Learning from Collective Intelligence , 2016, ACM Trans. Multim. Comput. Commun. Appl..

[28] Jialie Shen,et al. Forbidden City Great Wall Old SummerPalace Temple of Heaven Tiananmen Square Avenue of Stars Disneyland Resort Peninsular Hotel Tian Tan Budda Victoria Harbour Big Ben Buckingham Palace , 2016 .

[29] Mohan S. Kankanhalli,et al. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30] Mohan S. Kankanhalli,et al. Multi-stream Deep Learning Framework for Automated Presentation Assessment , 2016, 2016 IEEE International Symposium on Multimedia (ISM).

[31] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[32] Harry W. Agius,et al. Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[33] Yale Song,et al. Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[35] Ye Yuan,et al. Encode, Review, and Decode: Reviewer Module for Caption Generation , 2016, ArXiv.

[36] Matej Kristan,et al. Histograms of optical flow for efficient representation of body motion , 2010, Pattern Recognit. Lett..

[37] Ba Tu Truong,et al. Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[38] Ming Gao,et al. BiRank: Towards Ranking on Bipartite Graphs , 2017, IEEE Transactions on Knowledge and Data Engineering.

[39] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[40] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[41] Ian D. Reid,et al. Stable multi-target tracking in real-time surveillance video , 2011, CVPR 2011.

[42] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Yuting Su,et al. Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning , 2015, IEEE Transactions on Cybernetics.

[44] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45] Kewei Tu,et al. Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[46] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Jingfan Guo,et al. Salient object detection in RGB-D image based on saliency fusion and propagation , 2015, ICIMCS '15.

[48] Mohan S. Kankanhalli,et al. Benchmarking a Multimodal and Multiview and Interactive Dataset for Human Action Recognition , 2017, IEEE Transactions on Cybernetics.

[49] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).