Fine-grained Action Detection in Untrimmed Surveillance Videos

Spatiotemporal localization of activities in untrimmed surveillance videos is a hard task, especially given the occurrence of simultaneous activities across different temporal and spatial scales. We tackle this problem using a cascaded region proposal and detection (CRPAD) framework implementing frame-level simultaneous action detection, followed by tracking. We propose the use of a frame-level spatial detection model based on advances in object detection and a temporal linking algorithm that models the temporal dynamics of the detected activities. We show results on the VIRAT dataset through the recent Activities in Extended Video (ActEV) challenge that is part of the TrecVID competition[1, 2].

[1]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Jonathan G. Fiscus,et al.  TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search , 2018, TRECVID.

[4]  Haroon Idrees,et al.  Action Localization in Videos through Context Walk , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[7]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[9]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[11]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  C. V. Jawahar,et al.  Unsupervised Learning of Deep Feature Representation for Clustering Egocentric Actions , 2017, IJCAI.

[13]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[15]  Thomas Serre,et al.  An end-to-end generative framework for video segmentation and recognition , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Luc Van Gool,et al.  Actionness Estimation Using Hybrid Fully Convolutional Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Li Yao,et al.  DT-3DResNet-LSTM: An Architecture for Temporal Activity Recognition in Videos , 2018, TRECVID.