论文信息 - MMVG-INF-Etrol@TRECVID 2019: Activities in Extended Video

MMVG-INF-Etrol@TRECVID 2019: Activities in Extended Video

We propose a video analysis system detecting activities in surveillance scenarios which wins Trecvid Activities in Extended Video (ActEV1) challenge 2019. For detecting and localizing surveillance events in videos, Argus employs a spatialtemporal activity proposal generation module facilitating object detection and tracking, followed by a sequential classification module to spatially and temporally localize persons and objects involved in the activity. We detail the design challenges and provide our insights and solutions in developing the state-of-the-art surveillance video analysis system.

[1] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[3] Yi Yang,et al. Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Jonathan G. Fiscus,et al. TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval , 2019, TRECVID.

[5] Larry S. Davis,et al. Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Xiaojun Chang,et al. Adaptive Semi-Supervised Feature Selection for Cross-Modal Retrieval , 2019, IEEE Transactions on Multimedia.

[8] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Omkar M. Parkhi,et al. VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[10] Dietrich Paulus,et al. Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[11] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12] Xiaojun Chang,et al. Feature Interaction Augmented Sparse Learning for Fast Kinect Motion Detection , 2017, IEEE Transactions on Image Processing.

[13] Larry S. Davis,et al. AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[14] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Ming Yang,et al. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[16] Paul Over,et al. Instance search retrospective with focus on TRECVID , 2017, International Journal of Multimedia Information Retrieval.

[17] Yaser Sheikh,et al. Informedia @ TRECVID 2018: Ad-hoc Video Search, Video to Text Description, Activities in Extended video , 2018, TREC Video Retrieval Evaluation.

[18] Yu Qiao,et al. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[19] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[20] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[21] Alexander G. Hauptmann,et al. Minding the Gaps in a Video Action Analysis Pipeline , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).