MMVG-INF-Etrol@TRECVID 2019: Activities in Extended Video

We propose a video analysis system detecting activities in surveillance scenarios which wins Trecvid Activities in Extended Video (ActEV1) challenge 2019. For detecting and localizing surveillance events in videos, Argus employs a spatialtemporal activity proposal generation module facilitating object detection and tracking, followed by a sequential classification module to spatially and temporally localize persons and objects involved in the activity. We detail the design challenges and provide our insights and solutions in developing the state-of-the-art surveillance video analysis system.

[1]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Yi Yang,et al.  Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jonathan G. Fiscus,et al.  TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval , 2019, TRECVID.

[5]  Larry S. Davis,et al.  Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Xiaojun Chang,et al.  Adaptive Semi-Supervised Feature Selection for Cross-Modal Retrieval , 2019, IEEE Transactions on Multimedia.

[8]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[10]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Xiaojun Chang,et al.  Feature Interaction Augmented Sparse Learning for Fast Kinect Motion Detection , 2017, IEEE Transactions on Image Processing.

[13]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[16]  Paul Over,et al.  Instance search retrospective with focus on TRECVID , 2017, International Journal of Multimedia Information Retrieval.

[17]  Yaser Sheikh,et al.  Informedia @ TRECVID 2018: Ad-hoc Video Search, Video to Text Description, Activities in Extended video , 2018, TREC Video Retrieval Evaluation.

[18]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[19]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[20]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[21]  Alexander G. Hauptmann,et al.  Minding the Gaps in a Video Action Analysis Pipeline , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).