Trimmed Action Recognition, Dense-Captioning Events in Videos, and Spatio-temporal Action Localization with Focus on ActivityNet Challenge 2019

This notebook paper presents an overview and comparative analysis of our systems designed for the following three tasks in ActivityNet Challenge 2019: trimmed action recognition, dense-captioning events in videos, and spatio-temporal action localization.

[1]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[2]  Tao Mei,et al.  Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ling-Yu Duan,et al.  Unified Spatio-Temporal Attention Networks for Action Recognition in Videos , 2019, IEEE Transactions on Multimedia.

[4]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[6]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[7]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Tao Mei,et al.  Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Tao Mei,et al.  To Create What You Tell: Generating Videos from Captions , 2017, ACM Multimedia.

[11]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Tao Mei,et al.  Pointing Novel Objects in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tao Mei,et al.  MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos , 2017 .

[14]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[16]  Tao Mei,et al.  Seeing Bot , 2017, SIGIR.

[17]  Tao Mei,et al.  MSR Asia MSM at THUMOS Challenge 2015 , 2015 .

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Chong-Wah Ngo,et al.  Learning Spatio-Temporal Representation With Local and Global Diffusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Tao Mei,et al.  MSR Asia MSM at ActivityNet Challenge 2016 , 2017 .

[22]  Tao Mei,et al.  Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning , 2019, AAAI.

[23]  Tao Mei,et al.  Deep Quantization: Encoding Convolutional Activations with Deep Generative Model , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[27]  Pratik Rane,et al.  Self-Critical Sequence Training for Image Captioning , 2018 .

[28]  Tao Mei,et al.  Gaussian Temporal Awareness Networks for Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Tao Mei,et al.  Recurrent Tubelet Proposal and Recognition Networks for Action Detection , 2018, ECCV.