Spatio-Temporal AutoEncoder for Video Anomaly Detection

Anomalous events detection in real-world video scenes is a challenging problem due to the complexity of "anomaly" as well as the cluttered backgrounds, objects and motions in the scenes. Most existing methods use hand-crafted features in local spatial regions to identify anomalies. In this paper, we propose a novel model called Spatio-Temporal AutoEncoder (ST AutoEncoder or STAE), which utilizes deep neural networks to learn video representation automatically and extracts features from both spatial and temporal dimensions by performing 3-dimensional convolutions. In addition to the reconstruction loss used in existing typical autoencoders, we introduce a weight-decreasing prediction loss for generating future frames, which enhances the motion feature learning in videos. Since most anomaly detection datasets are restricted to appearance anomalies or unnatural motion anomalies, we collected a new challenging dataset comprising a set of real-world traffic surveillance videos. Several experiments are performed on both the public benchmarks and our traffic dataset, which show that our proposed method remarkably outperforms the state-of-the-art approaches.

[1]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  K. Grauman,et al.  Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental updates , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[9]  Nicu Sebe,et al.  A Prototype Learning Framework Using EMD: Application to Complex Scenes Analysis , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Nicu Sebe,et al.  Learning Deep Representations of Appearance and Motion for Anomalous Event Detection , 2015, BMVC.

[11]  Jin Young Choi,et al.  Motion Interaction Field for Accident Detection in Traffic Surveillance Video , 2014, 2014 22nd International Conference on Pattern Recognition.

[12]  L. Kratz,et al.  Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Jonghyun Choi,et al.  Learning Temporal Regularity in Video Sequences , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[15]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[17]  Cewu Lu,et al.  Abnormal Event Detection at 150 FPS in MATLAB , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Graham W. Taylor,et al.  Deconvolutional networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Gabriel Kreiman,et al.  Unsupervised Learning of Visual Structure using Predictive Generative Networks , 2015, ArXiv.

[21]  Antonio Torralba,et al.  Anticipating the future by watching unlabeled video , 2015, ArXiv.

[22]  Li Fei-Fei,et al.  Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[24]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[25]  Javier Sánchez Pérez,et al.  TV-L1 Optical Flow Estimation , 2013, Image Process. Line.

[26]  Nuno Vasconcelos,et al.  Anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Rasmus Berg Palm,et al.  Prediction as a candidate for learning deep hierarchical models of data , 2012 .

[28]  Christophe Rosenberger,et al.  Abnormal events detection based on spatio-temporal co-occurences , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Venkatesh Saligrama,et al.  Video anomaly detection based on local statistical aggregates , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Mubarak Shah,et al.  Abnormal crowd behavior detection using social force model , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Junsong Yuan,et al.  Sparse reconstruction cost for abnormal event detection , 2011, CVPR 2011.

[32]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[34]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).