NTT_CQUPT@TRECVID2019 ActEV: Activities in Extended Video

In this notebook paper, we present our activity detection system, which aims to temporally localize activities in surveillance videos. Our pipeline composed of five modules, object detection, activity proposal generation, feature extraction, classification and post-processing. We input RGB and optical flow into this pipeline separately and obtain frame level predictions by late fusion. The final detections are generated by greedily merging these predictions and filtering invalid results. 1. System Description Activity detection in surveillance videos is a challenging task due to the low resolution, occlusion of objects and similarity between activities. In order to get reliable results, most previous participants used the method of decomposing the task into multiple subtask [1, 2, 3]. Our system for activity detection in extended videos (ActEV) in TRECVID2019[4] is composed of five modules: object detection, activity proposal generation, feature extraction, classification and postprocessing. The diagram of the five modules in our system is shown in Figure 1. We evaluate and analysis each module separately in the following. Object detection: It locates and classifies objects and activities. Activity proposal generation: It generates candidate tubes by temporally tracking bounding boxes for activities. These tubes are called activity proposals. Feature extraction: It finetunes the backbone network and extracts features for activity proposals. Classification: It trains classifier to classify activity proposals. Post-processing: It merges the activity proposals for activity localization. Object Detection Proposal Generation I3D Feature Extraction Classification Postprocessing Input Activity category and location Figure 1: System Overview.

[1]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[2]  Jun-Cheng Chen,et al.  A Proposal-Based Solution to Spatio-Temporal Action Detection in Untrimmed Videos , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Jonathan G. Fiscus,et al.  TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval , 2019, TRECVID.

[4]  Alexander G. Hauptmann,et al.  Minding the Gaps in a Video Action Analysis Pipeline , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[5]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[6]  Chong-Wah Ngo,et al.  vireoJD-MM at Activity Detection in Extended Videos , 2019, ArXiv.

[7]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[9]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).