LSTM-based multi-label video event detection

Since large-scale surveillance videos always contain complex visual events, how to generate video descriptions effectively and efficiently without human supervision has become mandatory. To address this problem, we propose a novel architecture for jointly recognizing multiple events in a given surveillance video, motivated by the sequence to sequence network. The proposed architecture can predict what happens in a video directly without the preprocessing of object detection and tracking. We evaluate several variants of the proposed architecture with different visual features on a novel dataset perpared by our group. Moreover, we compute a wide range of quantitative metrics to evaluate this architecture. We further compare it to the popular Support Vector Machine-based visual event detection method. The comparison results suggest that the proposal method can outperform the traditional computer vision pipelines for visual event detection.

[1]  Dar-Shyang Lee,et al.  Effective Gaussian mixture learning for video background subtraction , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Kate Saenko,et al.  Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text , 2016, EMNLP.

[3]  Takeo Kanade,et al.  Introduction to the Special Section on Video Surveillance , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[6]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Yangqing Jia,et al.  Deep Convolutional Ranking for Multilabel Image Annotation , 2013, ICLR.

[9]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Yael Pritch,et al.  Clustered Synopsis of Surveillance Video , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[12]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[13]  Li Fei-Fei,et al.  VideoSET: Video Summary Evaluation through Text , 2014, ArXiv.

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Stan Sclaroff,et al.  Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web , 2015, Pattern Recognit..

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  David J. Crandall,et al.  DeepDiary: Automatically Captioning Lifelogging Image Streams , 2016, ECCV Workshops.

[18]  Tieniu Tan,et al.  Recent developments in human motion analysis , 2003, Pattern Recognit..

[19]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[21]  Zan Gao,et al.  Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition , 2015, Signal Process..

[22]  Meng Wang,et al.  Oracle in Image Search: A Content-Based Approach to Performance Prediction , 2012, TOIS.

[23]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[25]  Hironobu Fujiyoshi,et al.  Real-time human motion analysis by image skeletonization , 1998, Proceedings Fourth IEEE Workshop on Applications of Computer Vision. WACV'98 (Cat. No.98EX201).

[26]  Jürgen Schmidhuber,et al.  LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[27]  Tat-Seng Chua,et al.  Learning from Collective Intelligence , 2016, ACM Trans. Multim. Comput. Commun. Appl..

[28]  Jialie Shen,et al.  Forbidden City Great Wall Old SummerPalace Temple of Heaven Tiananmen Square Avenue of Stars Disneyland Resort Peninsular Hotel Tian Tan Budda Victoria Harbour Big Ben Buckingham Palace , 2016 .

[29]  Mohan S. Kankanhalli,et al.  Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Mohan S. Kankanhalli,et al.  Multi-stream Deep Learning Framework for Automated Presentation Assessment , 2016, 2016 IEEE International Symposium on Multimedia (ISM).

[31]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[32]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[33]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[35]  Ye Yuan,et al.  Encode, Review, and Decode: Reviewer Module for Caption Generation , 2016, ArXiv.

[36]  Matej Kristan,et al.  Histograms of optical flow for efficient representation of body motion , 2010, Pattern Recognit. Lett..

[37]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[38]  Ming Gao,et al.  BiRank: Towards Ranking on Bipartite Graphs , 2017, IEEE Transactions on Knowledge and Data Engineering.

[39]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[40]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[41]  Ian D. Reid,et al.  Stable multi-target tracking in real-time surveillance video , 2011, CVPR 2011.

[42]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Yuting Su,et al.  Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning , 2015, IEEE Transactions on Cybernetics.

[44]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Kewei Tu,et al.  Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[46]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jingfan Guo,et al.  Salient object detection in RGB-D image based on saliency fusion and propagation , 2015, ICIMCS '15.

[48]  Mohan S. Kankanhalli,et al.  Benchmarking a Multimodal and Multiview and Interactive Dataset for Human Action Recognition , 2017, IEEE Transactions on Cybernetics.

[49]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).